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Preface 



This volume contains the Proceedings of the International Symposium on Com- 
puting in Object-Oriented Parallel Environments (ISCOPE ’98), held at Santa 
Fe, New Mexico, USA on December 8-11, 1998. ISCOPE is in its second year,^ 
and continues to grow both in attendance and in the diversity of the subjects 
covered. ISCOPE’97 and its predecessor conferences focused more narrowly on 
scientific computing in the high-performance arena. ISCOPE ’98 retains this 
emphasis, but has broadened to include discrete-event simulation, mobile com- 
puting, and web-based metacomputing. 

The ISCOPE ’98 Program Committee received 39 submissions, and accep- 
ted 10 (26%) as Regular Papers, based on their excellent content, maturity of 
development, and likelihood for widespread interest. These 10 are divided into 
three technical categories. 

Applications: The first paper describes an approach to simulating advanced 
nuclear power reactor designs that incorporates multiple local solution me- 
thods and a natural extension to parallel execution. The second paper discus- 
ses a Time Warp simulation kernel that is highly configurable and portable. 
The third gives an account of the development of software for simulating 
high-intensity charged particle beams in linear particle accelerators, based 
on the POOMA framework, that shows performance considerably better 
than an HPF version, along with good parallel speedup. 

Runtime and Libraries: The first paper in this category evaluates Java as a 
language and system for high-performance numerical computing, exposing 
some issues to face in language features and compilation strategies. The se- 
cond describes using the Illinois Concert system to parallelize an adaptive 
mesh refinement code, showing that a combination of aggressive compiler 
optimizations and advanced run-time support can yield good parallel per- 
formance for dynamic applications. The third paper presents a unified fra- 
mework for building a numerical linear algebra library for dense and sparse 
matrices, achieving high performance and minimizing architectural depen- 
dencies. In the fourth paper, a parallel run-time substrate is presented that 
supports a global addressing scheme, object mobility, and automatic message 
forwarding for implementing adaptive applications on distributed-memory 
machines. 

Numerics and Algorithms: The first paper describes a software package for par- 
titioning data on structured grids, supporting inherent and new partitioning 
algorithms, and describes its use in two applications. The second describes 
a family of multiple minimum degree algorithms to generate permutations 
of large, sparse, symmetric matrices to minimize time and space required 
in matrix factorization. The final regular paper discusses two optimizing 
transformations for numerical frameworks, one that reduces inter-processor 
communication and another that improves cache utilization. 

^ The ISCOPE’97 Proceedings are available from Springer as LNCS, Vol. 1343. 
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In addition, the Program Committee selected 15 submissions as Short Papers. 
These papers were deemed to represent important work of a more specialized 
nature or to describe projects that are still in development. The Short Papers 
are divided into four technical categories. 

Metacomputing: The first paper presents a Java-based infrastructure to com- 
bine web-based metacomputing with cluster-based parallel computing. The 
second describes an experimental metacomputing system that is dynami- 
cally reconfigurable in its use of systems and networks, and also in its own 
capabilities. The third paper outlines a distributed platform to ease the com- 
bination of heterogeneous networks, concentrating on the design of its kernel 
software. The fourth paper presents language constructs for the simultaneous 
creation of entire static object networks which have useful properties. 

Frameworks and Run-time: The first paper describes a class library for FIFO 
queues that can be incorporated with Time Warp simulation mechanisms 
and retain the advantages of inlined data structures and efficient state sa- 
ving. The second paper presents a thread profiling system that is cognizant of 
the underlying concurrent run-time environment. The third paper evaluates 
a high-level, portable, multithreaded run-time system for supporting con- 
current object-oriented languages. The fourth describes a run-time library 
for data-parallel applications that covers a spectrum of parallel granulari- 
ties, problem regularities and user-defined data structures. The last paper 
in this section describes the design and use of a component architecture for 
large-scale simulations of scientific problems, based in turn on the POOMA 
framework. 

Numerics and Algorithms: The first paper discusses the parallelization and im- 
plementation of Monte Carlo simulations for physical problems. The second 
presents a parallel implementation of the dynamic recursion method for tri- 
diagonalizing sparse matrices efficiently. The third discusses the design of 
software for solving sparse, symmetric systems of linear equations by di- 
rect methods. The fourth paper describes a template library of two-phase 
container classes and communication primitives for parallel dynamic mesh 
applications. 

Arrays: The first of two papers in this category describes the Blitz-|— I- libr- 
ary, meant to provide a base environment of vectors, arrays and matrices 
for scientific computing with C-|— 1-. The second discusses the design of ar- 
rays and expression evaluation strategies in the new POOMA II framework 
development. 

This collection of 25 papers represents the state of the art in applying object- 
oriented methods to parallel computing. ISCOPE ’98 is truly international in 
scope, with its 72 contributing authors representing 24 research institutions in 
9 countries. The ISCOPE ’98 organizers are confident that the reader will share 
their excitement about this dynamic and important area of computer science 
and applications research. 

At the end of this volume, the Author Contacts section details the affiliations, 
postal addresses, and email addresses of all the proceedings authors. 




Preface 



VII 



ISCOPE ’98 is partially supported by the Mathematical, Information, and 
Computational Sciences Division, Office of Energy Research, U.S. Department 
of Energy. 



Steering Committee 



Dennis Gannon, Indiana University 

Denis Caromel, University of Nice-INRIA Sophia Antipolis 
Yutaka Ishikawa, Real World Computing Partnership 
John Reynders, Los Alamos National Lab 
Satoshi Matsuoka, Tokyo Institute of Technology 

Jorg Nolte, German National Research Center for Information Technology 



Organizing Chairs 



Dennis Gannon, Indiana University, General Chair 

Denis Caromel, University of Nice-INRIA Sophia Antipolis, Program 

Yutaka Ishikawa, Real World Computing Partnership, Posters 

John Reynders, Los Alamos National Lab, Workshops/BOF 

Rodney R. Oldehoeft, Colorado State University, Proceedings 

Marydell Tholburn, Los Alamos National Lab, Local Arrangements/Publicity 



Program Committee 



Ole Agesen, Sun Microsystems Labs, USA 

Denis Caromel, University of Nice-INRIA Sophia Antipolis, France 
Antonio Corradi, University of Bologna, Italy 

Geoffrey Fox, Northeast Parallel Architecture Center, Syracuse Univ., USA 

Dennis Gannon, Indiana University, USA 

Jean-Marc Geib, University of Lille, France 

Andrew Grimshaw, University of Virginia, USA 

Urs Holzle, University of California-Santa Barbara, USA 

Yutaka Ishikawa, Real World Computing Partnership, Japan 

Jean-Marc Jezequel, IRISA/CNRS, France 

Pierre Kuonen, EPFL, Switzerland 

Satoshi Matsuoka, Tokyo Institute of Technology, Japan 

Jorg Nolte, Institute GMD-FIRST, Germany 

Rodney R. Oldehoeft, Colorado State University, USA 

John Reynders, Los Alamos National Lab, USA 

Wolfgang Schroeder-Preikschat, Magdeburg, GMD, Germany 

Anthony Skjellum, Mississippi State University, USA 

David F. Snelling, Fujitsu European Center for Information Technology, UK 




VIII Preface 



Kenjiro Taura, University of Tokyo, Japan 
MaryDell Tholburn, Los Alamos National Lab, USA 
Andrew L. Wendelborn, University of Adelaide, Australia 
Russel Winder, King’s College London, UK 

Additional External Referees 

Cliff Addison Sven van den Berghe Danilo Beuche 

Lars Buettner P. Calegari Peter Chow 

Paul Coddington George Crawford David Detlefs 

Rossen Dimitrov Stephane Ecolivet Antonio A. Frohlich 

Robert George Matt Gleeson Abdelaziz Guerrouat 

F. Guidec Frederic Guyomarc’h Ute Haack 

Peter Harrison Greg Henley Wai Ming Ho 

Naoki Kobayashi Evelina Lamma Pascale Launay 

Andrea Omicini Jean-Louis Pazat R. Radhakrishnan 

Tomasz Radzik Y.S. Ramakrishna C. Stefanelli 

F. Zambonelli 




Table of Contents 



Regular Papers 
Applications 

Object-Oriented Approach for an Iterative Calculation Method and Its 

Parallelization with Domain Decomposition Method 1 

Masahiro Tatsumi, Akio Yamamoto 

An Object-Oriented Time Warp Simulation Kernel 13 

Radharamanan Radhakrishnan, Dale E. Martin, Malolan Chetlur, 
Dhananjai Madhava Rao, Philip A. Wilsey 

Particle Beam Dynamics Simulations Using the POOMA Framework 25 

William Humphrey, Robert Ryne, Timothy Cleland, Julian Cummings, 
Salman Habib, Graham Mark, Ji Qiang 

Runtime and Libraries 

An Evaluation of Java for Numerical Computing 35 

Brian Blount, Siddhartha Chatterjee 

High-Level Parallel Programming of an Adaptive Mesh Application Using 

the Illinois Concert System 47 

Bishwaroop Ganguly, Andrew Chien 

The Matrix Template Library: A Generic Programming Approach to High 

Performance Numerical Linear Algebra 59 

Jeremy G. Siek, Andrew Lumsdaine 

The Mobile Object Layer: A Run-Time Substrate for Mobile Adaptive 

Computations 71 

Nikos Chrisochoides, Kevin Barker, Demian Nave, Chris Hawblitzel 

Numerics and Algorithms I 

Software Tools for Partitioning Block-Structured Applications 83 

Jarmo Rantakokko 

An Object-Oriented Collection of Minimum Degree Algorithms 95 

Gary Kumfert, Alex Pothen 

Optimizing Transformations of Stencil Operations for Parallel Object- 

Oriented Scientific Frameworks on Cache-Based Architectures 107 

Federico Bassetti, Kei Davis, Dan Quinlan 




X 



Table of Contents 



Short Papers 
Metacomputing 

Merging Web-Based with Cluster-Based Computing 119 

Luis Moura Silva, Paulo Martins, Jodo Gabriel Silva 

Dynamic Reconfiguration and Virtual Machine Management in the 

Harness Metacomputing System 127 

Mauro Migliardi, Jack Dongarra, Al Geist, Vaidy Sunderam 

JEM-DOOS: The Java/RMI Based Distributed Objects Operating System 

of the JEM Project 135 

Serge Ghaumette 

Static Networks: A Powerful and Elegant Extension to Concurrent Object- 

Oriented Languages 143 

Josh Yelon, Laxmikant V. Kale 

Frameworks and Runtime 

A FIFO Queue Class Library as a State Variable of Time Warp Logical 

Processes 151 

Soichiro Hidaka, Terumasa Aoki, Hitoshi Aida, Tadao Saito 

/^Profiler: Profiling User-Level Threads in a Shared-Memory Programming 

Environment 159 

Peter A. Buhr, Robert Denda 

Evaluating a Multithreaded Runtime System for Concurrent 

Object-Oriented Languages 167 

Antonio J. Nebro, Ernesto Pimentel, Jose M. Troya 

Object-Oriented Run-Time Support for Data-Parallel Applications 175 

Hua Bi, Matthias Kessler, Matthias Wilhelmi 

Component Architecture of the Tecolote Framework 183 

Mark Zander, John Hall, Jim Painter, Sean O’Rourke 

Numerics and Algorithms II 

Parallel Object Oriented Monte Carlo Simulations 191 

Matthias Troyer, Beat Ammon, Elmar Heeb 

A Parallel, Object-Oriented Implementation of the Dynamic Recursion 

Method 199 

Wolfram T. Arnold, Roger Haydock 

Object-Oriented Design for Sparse Direct Solvers 207 

Florin Dobrian, Gary Kumfert, Alex Pothen 




Table of Contents 



XI 



Janus: A C++ Template Library for Parallel Dynamic Mesh Applications . 215 
Jens Gerlach, Mitsuhisa Sato, Yutaka Ishikawa 

Arrays 

Arrays in Blitz++ 223 

Todd L. Veldhuizen 

Array Design and Expression Evaluation in POOMA II 231 

Steve Karmesin, James Crotinger, Julian Cummings, Scott Haney, 
William Humphrey, John Reynders, Stephen Smith, Timothy Williams 

Author Contacts 239 

Author Index 243 




Object-Oriented Approach for an Iterative 
Calculation Method and Its Parallelization with 
Domain Decomposition Method 



Masahiro Tatsumi^’^* and Akio Yamamoto^ 

^ Nuclear Fuel Industries, Ltd., Osaka, Japan 
^ Osaka University, Osaka, Japan 



Abstract. With trends toward more complex nuclear reactor designs, 
advanced methods are required for appropriate reduction of design mar- 
gins from an economical point of view. As a solution, an algorithm based 
on an object-oriented approach has been developed. In this algorithm, 
calculation meshes are represented as calculation objects wherein speci- 
fic calculation algorithms are encapsulated. Abstracted data, which are 
neutron current objects, are exchanged between these objects. Calcula- 
tion objects can retrieve required data having specified data types from 
the neutron current objects, which leads to a combined use of different 
calculation methods and algorithms in the same computation. Introdu- 
cing a mechanism of object archiving and transmission has enabled a 
natural extension to a parallel algorithm. The parallel solution is identi- 
cal with the sequential one. The SCOPE code, an actual implementation 
of our algorithm, showed good performance on a networked PC cluster, 
for sufficiently coarse granularity. 



1 Introduction 

The generation of electricity by nuclear energy in Japan supplies approximately 
30% of total power generation. Nuclear power is becoming more important from 
the viewpoint of preventing the greenhouse effect, and of providing sufficient 
supply against increasing power demand. Advanced degrees of safety are funda- 
mentally required to utilize nuclear power, because of the social impact in case 
of severe accidents. At the same time, efficient design is also important from an 
economical point of view. 

Good designs can improve efficiency and save energy generation costs. Thus 
advanced and precise design tools have been become very important. So far, 
several kinds of approximations 0 in modeling actual reactor cores have been 
developed and adopted in order to perform calculations using computers with 
limited performance. While those models have contributed to a reduction in 
computing time while maintaining required precision, they are not suitable for 
direct use in much more complex future designs. Therefore, a more comprehen- 
sive technique in problem modeling and its application to reactor design will be 

* Also a Ph.D. candidate at the Graduate School of Engineering, Osaka University. 
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needed. However, it will not be sufficient simply to perform high-quality calcu- 
lations with complex designs because of the associated high computation cost. 
Consequently, an optimization technique for the calculation method itself is nee- 
ded that results in realistic computation times and high-accuracy solutions when 
handling actual designs. 

We have developed a new calculation technique using an object-oriented ap- 
proach that enables coupling of arbitrary calculation theories in the same calcu- 
lation stage. In this technique, an object based on a particular calculation theory 
is allocated for each calculation mesh in the finite difference method. Those cal- 
culation meshes have the same interface for data exchange with adjacent meshes, 
which enables the coupling of different calculation theories, providing an “ade- 
quate theory for any domain.” 

In the next section, analysis of reactor cores by neutronics calculations is 
briefly reviewed. Techniques with an object-oriented approach and its natural 
extension to the parallel algorithm are explained in Sect. El and 0, respectively. 
Several results of a performance analysis are shown in Sect. 0 After some di- 
scussion in Sect. 0 we conclude in Sect. 0 Finally plans for further study are 
shown in Sect. 0 

2 Reactor Core Analysis 

In a reactor core, steam is generated by thermal energy produced by controlled 
fission chain reactions. The steam is injected into turbines that produce electrical 
energy and is finally returned into the core after condensing to liquid water. 
A large portion of fission reactions is directly caused by neutrons. Therefore, 
it is quite important to estimate accurately the space and energy distribution 
of neutrons in the reactor. However, the size of a neutron is extremely small, 
compared to that of the reactor, with a wide range of energy distributions. This 
makes a microscopic analysis by handling neutrons explicitly very difficult. So 
an approximation is needed. 

In the next two subsections, the computation model of reactor core analysis, 
and solution methods to solve the neutron transport equation in the reactor core 
are described briefly. 



2.1 Computation Model 

In a typical analysis technique, three modeling stages are assumed for space and 
energy in interactions of neutrons and materials: microscopic, mesoscopic and 
macroscopic stages. Such models are widely used in other kinds of analysis, and 
boundaries among the three stages may be obscured with improvement in the 
calculation methods. 

Our aim is to perform more precise calculations by integrating the mesoscopic 
and macroscopic stages in reactor core analysis. A simple approach implicitly 
requires too much computation time, and is infeasibile for actual near-future 
designs. Therefore we developed a new calculation technique that can shorten 
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computation time with minimum degradation in accuracy by changing the solu- 
tion method of the domination equation according to importance of the domain. 
In the next section, a typical solution method for the neutron transport equation, 
the domination equation in the reactor core analysis, will be described. 



2.2 Solution Methods of Neutron Transport Equation 

The behavior of neutrons is described with the Boltzmann equation, but this is 
difficult to solve without discretization of energy. Normally, it is treated as a set 
of simultaneous equations of many groups after integrating within appropriate 
energy ranges. Our interest is to solve the discretized transport equations and 
obtain the space and energy distribution of neutron and fission under various 
conditions. 

Several methods can be utilized to solve the discretized transport equations |21 
Ej. For example, space phase can be divided as the flight direction of a neutron 
in the ordinate angle method (S^v method). The Legendre expansion method 
(Pl method) expresses the angle distribution with polynomial functions. These 
methods show good accuracy, but their computation costs are quite high. On 
the other hand, the diffusion approximation method treats only a representative 
direction, which is a good approximation in general and requires less computa- 
tion cost. However, accuracy worsens when there is a large gradient on neutron 
distribution, because anisotropy in the neutron flight direction is not accounted 
for. As mentioned, neutron currents are discretized in angles for the Sat method 
and in moments for the method. Under those circumstances, how can we be 
efficient when calculations meshed with different theories for solution methods 
reside in adjacent positions? In such a situation, one solution is based on an 
object-oriented approach. 



3 Object-Oriented Approach for Sequential Method 

In an object-oriented approach, a system is built using “objects” as if they were 
bricks that unifly data and procedures. These objects have high independence 
from each other, so direct reference to internal data of other objects is not 
allowed in this approach. In other words, one must call procedures in an object 
to set or retrieve data from the object. This apparently troublesome limitation 
produces the security of the model, which leads to high modularity, extensibility 
and reusability. 

First, we assume an implementation of the object-oriented approach to a 
meshing method such as the finite difference method (FDM). In the iterative 
FDM, successive calculations are performed until all the governing equations are 
satisfied. Variables in a calculation mesh are determined to satisfy the governing 
equation locally. 

Second, we consider each calculation mesh as an “object.” Each calculation 
node object has parameters necessary for calculation within the node. The gover- 
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Calculation Node Object 
Based on 

Neutron Diffusion Theory 



Requires 1 
Direction Neutron 
Current Data 



Calculation Node Object 
Based on 

Neutron Transport Theory 

I Direction N Direction 




' — A kind of approximations 
may be needed here. 



Fig. 1. A neutron current object that has automatic type-conversion inside can connect 
different type of calculation objects that represent calculation meshes 



ning equations are solved locally in the object by a solution method procedure of 
its own. Information to be exchanged between calculation objects is transferred 
by an object abstractly, for instance the neutron-current object in our applica- 
tion. Calculation objects can retrieve information from the neutron-current ob- 
jects with a specified data type by the mechanism of automatic type-conversion 
that is built in the neutron current object. Therefore, a calculation object does 
not need to know what kinds of calculation objects exist in adjacent positions. 
In this manner, various kinds of calculation method can be assigned at arbitrary 
mesh points in the domain. Figure [D illustrates this object-oriented approach. 

Each calculation node class has a specific procedure based on its calculation 
method, and the interface to the procedure is the same for all classes, for instance 
“calc().” Therefore a new class based on another calculation method can be 
derived from the base calculation class with common properties and procedures 
by defining the method-specific properties and overloading the “calc()” member 
function. Some classes of calculation node are listed in Table [D 

Those calculation nodes are stored and managed in the same manner that a 
container object can be built by the Container Controller object with the Region 
object. The container object represents the whole or decomposed domain of the 
system, while each calculation object in the container object represents a calcu- 
lation mesh. Note that calculation node objects in the container are independent 
each other and can be dynamically replaced by other types of calculation nodes 
(for example, with a higher-order approximation) at any stage during the itera- 
tive calculations. In this way, a quite flexible system can be built easily with the 
object-oriented approach. For example, a calculation method/algorithm can be 
upgraded by region and by time. The object classes that are used for construction 
of the calculation domain are listed in Table |3 




Object-Oriented Approach for an Iterative Calculation Method 



5 



Table 1. List of currently available classes of calculation objects 



Class Name 


Description 


Node 


Abstract superclass for all kinds of nodes 


BNode 


Node on system and processor boundary derived from 
Node class. Basic calculation node with method 
independent properties and procedures 


CNode 


for each specific calculation nodes as a superclass. 
This class is derived from the Node class. 


FDDNode 


Calculation node based on the finite difference 
diffusion approximation (derived from CNode ) 


SPSNode 


Calculation node based on the Simplified P3 
transport approximation (derived from CNode ) 


...and more 


There are several derived classes of calculation node 
for each specific calculation method/algorithm. 



Table 2. Object classes for constructing calculation domain 



Class Name 


Description 


Region 


Abstracted representation of geometric configuration 
of calculation domain, such as boundary conditions, 
material map, calculation node map, etc. 


Container 


Abstracted calculation domain that keeps nodes inside, 
performs three-dimensional Red/Black checkerboard 
sweep, etc. 


Container Controller 


Controlling object for Container. 



4 Natural Extension to a Parallel Algorithm 



With a finer model of computation, the growth in demand for computing re- 
sources becomes a significant problem. As a result, attention has been paid to 
parallel and distributed computing. In this section a natural extension of the 
above object-oriented approach to a parallel and distributed computing environ- 
ment is described. 

A merit of the object-oriented approach is independence among objects by en- 
capsulating information in them. This allows concurrency among objects because 
object interaction is minimized by the nature of the object-oriented approach. 

Parallel computing on a distributed-memory system requires inter-processor 
communication over a network. Message passing libraries such as MPIpQE] and 
PVMjOj are provided for high performance computing that supports transmis- 
sion of arrays of basic data types (int, double, etc) and derived data types such 
as vectors and structures. However a higher data abstraction by object archi- 
ving and transmission is not supported. Therefore we introduced a mechanism 
to encode and decode objects to be transferred. With this extension, objects 
are virtually transmitted between processors with great security of data inside 
objects. 
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In our approach, assignment of portions of a problem to each processor is 
performed by domain decomposition 0. Each processor has an object defined by 
the Region class in which all the information is encapsulated to map an actual 
problem domain into calculation objects. In parallel computing, for instance, 
each processor in a group has data for only a part of the total system and com- 
putes in it as its responsible domain. In the Region object, boundary conditions 
and some information for parallel computing (such as processor index for each 
direction) are also defined. 

In domain decomposition, a quasi-master processor divides a total system 
into portions of the domain, produces Region objects for them, and distributes 
them to client processors. Each processor receives a Region object and constructs 
a responsible domain. Note that parallelization of serial code can be done quite 
naturally with relatively small changes, because the fundamental data structures 
do not change owing to data encapsulation by objects. Additional work was 
needed for only a few classes: Region and RegionManager classes, and ObjectPass 
for domain decomposition and object transmission, respectively. A new class, 
CurrentVector, for packing and unpacking of Current objects was also introduced 
for efficient data transmission between processors. 

Figure El shows domain decomposition using the Region object for parallel 
computing with two processors. The only difference between parallel and serial 
computing is whether there are pseudo-boundaries between processors where 
communication is needed. Consequently, the parallel version follows exactly the 
same process of convergence as the serial version in a iterative calculation. 



Dif : A Node Based on 
Diffusion Theory 

Tr ; A Node Based on 
Transport Theory 

• Neutron-Current Object 




In Serial Computing 




Fig. 2. Parallel computing with two processors by domain decomposition 
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5 Performance Analysis 

A parallel solution algorithm for the neutron transport equation based on this 
approach has been implemented in C-| — h as the SCOPE code, currently under 
development. In the SCOPE code, several kinds of calculation nodes with dif- 
ferent bases of calculation theory can be used and assigned to arbitrary meshes 
in the domain. All the classes of calculation nodes currently available in the 
SCOPE code are listed in Table 01 MPICHjO| was used as the message passing 
library for communication among processors. 

The parallel performance of the SCOPE code was measured on a networked 
cluster of PCs connected with a Ethernet switch. The cluster consists of a server 
node and client nodes as described in Table II 

The server node provides services such as NFS so that the client nodes can 
mount home directories. This parallel environment can be considered as homoge- 
neous because the actual computation is performed only on the client nodes. The 
effective performance of peer-to-peer communication between client nodes was 
reasonable: Measurement of file transfer rates by rep gave a network bandwidth 
of about 900KB/s. 



Table 3. List of currently available classes of calculation objects 



Class Name 


Calculation Theory 


Number of data 
in a neutron 
current object 


Relative 

Computation 

Load 


FDDNode 


Diffusion Theory 


1 


1 


SP2Node 


Simplified P2 


1 


1.2 


SPSNode 


Simplified P3 


2 


2 


SPSNode 


Simplified P5 


3 


3 


ANMNode 


Analytical Nodal 
Polynomial Expansion 


1 


10 


S 4 Node 


Discrete Ordinate S4 


12 


12 


SBNode 


Discrete Ordinate S6 


24 


24 


SSNode 


Discrete Ordinate S8 


40 


40 



Table 4. System configuration of the networked PC cluster 





Server Node 


Client Node 


Model 


Dell OptiPlex Gxa 333 


Compaq DeskPro 590 


Operating System 


Linux 2.0.33 


Linux 2.0.33 


CPU 


Pentium II 333MHz 


Pentium 90MHz 


RAM 


160MB 


32-128MB 


Network Interface Card 


3Com 950 


PCNet32 
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Table 5. Problems analyzed on networked PC cluster 



Case Name 


Mesh Size 


Node Type 


Med 


60 X 60 X 6 


FDDNode 


Big 


60 X 60 X 18 


FDDNode 


Med-sp3 


60 X 60 X 6 


SPSNode 


Big-sp3 


60 X 60 X 18 


SPSNode 



Performance measurement was done with four problems listed in Table 0 
changing the problem size and calculation node type. Speedup curves are shown 
in Fig. 0for each case. 

In using FDDNode in a small problem(Fig. 3-(a)), performance rapidly de- 
grades with the number of processors because of the finer granularity. Problem 
“Big” is three times larger than “Med” in the z-axis, which improves performance 
(Fig. 3-(b)). SPSNode, with a computational load about twice that of FDDNode, 
showed better performance even in the “Med” case (Fig. 3-(c)). However the per- 
formance on eight processors worsened because of the 3D domain decomposition 
that requires additional initialization and smaller packets for communication as 
compared to 2D or ID decompositions. In this case, the 3D domain did not have 
enough computation load compared to communication loadflDj. The last case, 
using SPSNode on “Big” showed good performance on eight processors (Fig. 
3-(d)). 

Figure 0 predicts efficiency as a function of the number of processors when 
2D or 3D decomposition is performed. The curve in the figure gives efficiency 
relative to perfect speedup. If 50% is set as the criterion for parallel efficiency, 
one can roughly estimate that about 18 processors can be used for the problem. 
So good performance can be expected by using 18 or fewer processors in the 
calculation. 



6 Discussion 

Good performance and shorter computation time can be expected with parallel 
computing on a networked PC cluster when the problem to be solved has large 
enough granularity. The other approach, in contrast, is to use several kinds of 
calculation node objects in the same computation. For instance, one may assign 
more accurate calculation node objects for important areas and approximated 
calculation node objects for less important areas in the domain. This approach 
can reduce computation time greatly and main good accuracy overall. 

As an experiment for this approach, we performed some bench marks [I Ij. 
Three kinds of node configurations were examined: all FDDNodes, all SPSNodes, 
and a hybrid use of SPSNodes and FDDNodes. In the last case, SPSNodes were 
assigned to important regions such as fuel and control rods), while FDDNodes 
were used for less important region such as peripheral moderators. Calculation 
results and relative computation times are listed in Table 0 The last case, the 
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Fig. 3. Speedup curves for the SCOPE code on the networked PC cluster 
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Table 6. Impact of hybrid use of calculation nodes in a benchmark problem 



Case 


Eigenvalue 


Error (%) 


Relative 
Comp. Time 


FDDNode 


0.9318 


-2.97 


1.00 


SPSNode 


0.9590 


-0.14 


1.38 


Hybrid 


0.9590 


-0.14 


1.11 



hybrid use of SP3Nod.es and FDDNodes, reduced computation time compared 
to the second case with the same accuracy, defined as the Eigenvalue. 

7 Conclusions 

A solution algorithm for an iterative calculation based on an object-oriented 
approach has been developed. The object-oriented approach provides flexibility 
and extensibility at the same time, and enhances reusability and maintainability. 

A natural extension to a parallel algorithm has also been studied. Introduced 
object classes that perform domain decomposition and object transmission help a 
natural parallelization with minimum changes to the serial version. Exactly the 
same convergence properties can be expected for all processor configurations. 
High performance can be obtained for large granularity, even on a networked 
PC cluster. 

The hybrid use of several types of calculation node objects also reduces com- 
putation time with minimum degradation of accuracy. This approach is quite 
attractive and further studies are expected introducing multigrid analysis. 

8 Future Study 

Further investigation of parallel performance will be continued using other kinds 
of calculation nodes, such as ANMNode and S^Node. They have heavier compu- 
tation requirements compared to SPSNode, thus better parallel speedup can be 
expected. It is also important to investigate the hybrid use of calculation nodes 
requiring different current types, and Sn, for its accuracy and parallel per- 
formance. Furthermore, we will study the dynamic replacement of calculation 
nodes from the viewpoint of reducing total computation time and maintaining 
the quality of solutions. 
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Abstract. The design of a Time Warp simulation kernel is made dif- 
ficult by the inherent complexity of the paradigm. Hence it becomes 
critical that the design of such complex simulation kernels follow esta- 
blished design principles such as object-oriented design so that the im- 
plementation is simple to modify and extend. In this paper, we present a 
compendium of our efforts in the design and development of an object- 
oriented Time Warp simulation kernel, called warped, warped is a pu- 
blically available Time Warp simulation kernel for experimentation and 
application development. The kernel defines a standard interface to the 
application developer and is designed to provide a highly configurable en- 
vironment for the integration of Time Warp optimizations. It is written 
in C-|— k, uses the MPI message passing standard for communication, and 
executes on a variety of platforms including a network of SUN worksta- 
tions, a SUN SMP workstation, the IBM SP1/SP2 multiprocessors, the 
Cray T3E, the Intel Paragon, and IBM-compatible PCs running Linux. 



1 Introduction 

The Time Warp parallel synchronization protocol has been the topic of research 
for a number of years, and many modifications/optimizations have been propo- 
sed and analyzed PH However, these investigations are generally conducted in 
distinct environments with each optimization re-implemented for comparative 
analysis. Besides the obvious waste of manpower to re-implement Time Warp 
and its affiliated optimizations, the possibility for a varying quality of the im- 
plemented optimizations exists. 

The WARPED project is an attempt to make a freely available object-oriented 
Time Warp simulation kernel that is easily ported, simple to modify and extend, 
and readily attached to new applications. The primary goal of this project is to 
release an object-oriented software system that is freely available to the research 
community for analysis of the Time Warp design space. In order to make warped 
useful, the system must be easy to obtain, available with running applications, 
operational on several processing platforms, and easy to install, port, and extend. 

This paper describes the general structure of the warped kernel and pre- 
sents a compendium of the object-oriented design issues and problems that were 

* Support for this work was provided in part by the Advanced Research Projects 
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required to be solved. In addition, a description of two distinct application do- 
mains for WARPED is presented, warped is implemented as a set of libraries from 
which the user builds simulation objects. The warped kernel uses the MPI 
portable message passing interface and has been ported to several architectures, 
including: the IBM SPI/SP2, the Cray T3E, the Intel Paragon, a network of 
SUN workstations, an SMP SUN workstation, and a network of Pentium Pro 
PCs running Linux. 

The WARPED system is implemented in C-|— I- and utilizes the object-oriented 
capabilities of the language. Even if one is interested in warped only at the 
system interface level, they must understand concepts such as inheritance, vir- 
tual functions, and overloading. The benefit of this type of design is that the 
end user can redefine and reconfigure functions without directly changing ker- 
nel code. Any system function can be overloaded to fit the user’s needs and 
any basic system structure can be redefined. This capability allows the user to 
easily modify the system queues, algorithms or any part of the simulation ker- 
nel. This flexibility makes the warped system a powerful tool for Time Warp 
experimentation. 



Straggler Message 




Another benefit of the object-oriented nature of the warped application in- 
terface is that by its very design it is simple to “plug in” a different kernel. A 
sequential simulation kernel is supplied in the warped distribution in addition 
to the Time Warp kernel. Version 0.9 of the warped is available via the www 
at http://www.ece.uc.edu/~paw/warped/. The remainder of this paper is or- 
ganized as follows. Section 121 presents a description of the Time Warp paradigm. 
Section 0 details the warped kernel’s application/kernel interface and presents 
a compendium of the design issues that were required to be solved for the deve- 
lopment of the WARPED system. Section ^demonstrates, through two examples, 
the construction of simulation applications using the warped kernel. Finally, 
Sect.0 contains some concluding remarks. 
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2 Background 

In a Time Warp synchronized discrete event simulation, Virtual Time |2| is used 
to model the passage of the time in the simulation. The virtual time defines a 
total order on the events of the system. The simulation state (and time) advan- 
ces in discrete steps as each event is processed. The simulation is executed via 
several simulator processes, called simulation objects or logical processes (LP). 
Each LP is constructed from a physical process (PP) and three history queues. 
FigureGlillustrates the structure of an LP. The input and the output queues store 
incoming and outgoing events respectively. The state queue stores the state hi- 
story of the LP. Each LP maintains a clock that records its Local Virtual Time 
(LVT). LPs interact with each other by exchanging time-stamped event mes- 
sages. Changes in the state of the simulation occur as events are processed at 
specific virtual times. In turn, events may schedule other events at future virtual 
times. 

The LPs must be synchronized in order to maintain the causality of the 
simulation; although each LP processes local events in their (locally) correct 
time-stamp order, events are not globally ordered. Fortunately, each event need 
only be ordered with respect to events that affect it (and, conversely, events 
that it affects); hence, only a partial order of the events is necessary for correct 
execution Under optimistically synchronized protocols {e.g., the Time Warp 
model |2|), LPs execute their local simulation autonomously, without explicit 
synchronization. A causality error arises if a LP receives a message with a time- 
stamp earlier than its LVT (a straggler message). In order to allow recovery, 
the state of the LP and the output events generated are saved in history queues 
as events are processed. When a straggler message is detected, the erroneous 
computation must be undone — a rollback occurs. The rollback process consists 
of the following steps: the state of the LP is restored to a state prior to the 
straggler message’s time-stamp, and then erroneously sent output messages are 
canceled (by sending anti-messages to nullify the original messages) . The global 
progress time of the simulation, called Global Virtual Time (GVT), is defined 
as the time of the earliest unprocessed message in the system 0 |^ . Periodic 
GVT calculation is performed to reclaim memory space as history items with a 
time-stamp lower than GVT are no longer needed, and can be deleted to make 
room for new history items. 



3 The WARPED Application and Kernel Interface 

The WARPED kernel presents an interface to the application from building logical 
processes based on Jefferson’s original definition [2| of Time Warp. Logical pro- 
cesses (LPs) are modeled as entities which send and receive events to and from 
each other, and act on these events by applying them to their internal state. 
This being the case, basic functions that the kernel provides to the application 
are methods for sending and receiving events between LPs, and the ability to 
specify different types of LPs with unique definitions of state. One departure 
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Application 

Interface 



class UserLogicalProcess 



Kernel provides: 

void sendEvent(UserEvent *); 
bool haveMoreEventsO; 

UserEvent 'getEventO; 

VTime ’getSimulationTimeO; 
UserState *getState(); 

Application provides: 

void initializeO; 
void finalizeO; 
void executeProcessQ; 

UserState *allocateState(); 
void deallocateState(UserState *); 



Kernel 

Interface 



class 

KernelLogicalProcessInterface 



void initializeO: 
void finalizeO; 
void executeProcessO; 

KernelState *allocateState(); 
void deallocateState(KernelState *); 





class UserState 






void copyState(UserState *); 





class UserEvent 
Kernel provides: 

void setReceiver(int); 

void setReceiveTime(VT1me *); 

Application provides: 
Serializedlnstance *serializeO; 
UserEvent * 

deserialize(Serializedlnstance *); 



class KernelState 



void copyState(KernelState *); 



class KernelEvent 



int getReceiverO; 
VTime *getReceiveTimeOl 



class KernelObject 



void receiveUserEvent(KernelEvent *); 
bool hasMoreEventsO; 

KernelEvent 'giveCurrentEventO; 
KernelState "getStateO; 



Fig. 2. Application and kernel interface 



from Jefferson’s presentation of Time Warp is that LPs are placed into groups 
called “clusters” . LPs on the same cluster communicate with each other without 
the intervention of the message system, which is much faster than communi- 
cation through the message system. Hence, LPs which communicate frequently 
should be placed on the same cluster. Another feature of the cluster is that it is 
responsible for scheduling the LPs. Note that the LPs within a cluster operate 
as Time Warp processes; even though they are grouped together, they aren’t 
coerced into synchronizing with each other. 

Control is passed between the application and the kernel through the coo- 
perative use of function calls. This means that when a function is called in 
application code, the application is not allowed to block for any reason. Since 
the application has control of the single thread of control through its cluster, it 
could end up waiting forever. In order for the kernel to correctly interact with 
the application code, the user must provide several functions to the kernel. These 
functions define such things as how to initialize each LP, and what each LP does 
during a simulation cycle. In addition, if the user would like to use a non-standard 
definition of time, facilities are in place to provide a user-defined time class to 
the kernel. By default, warped has a simple notion of time. More precisely, time 
is defined in the class VTime as a signed integer. Obviously, particular instances 
of an application may have different requirements for the concept of time. For 
example, simulators for the hardware description language VHDL Pj require a 
more complex definition of time. If the simple, kernel-supplied version of time 
is not sufficient, the application programmer must define the class VTime with 
data members appropriate to the application’s needs. In addition, the user must 



An Object-Oriented Time Warp Simulation Kernel 



17 



use the preprocessor macro USE_USER_VTIME during compilation. The warped 
kernel also has requirements about the defined methods of the type VTime. Spe- 
cifically, the implementation of VTime must supply the following operators and 
data, either by default or through explicit instantiation: 

— Assignment (=), Addition (-I-), and subtraction (-) operators. 

— The relational operators: ==, !=, >=, <=, >, <. 

— Constant objects ZERO, PINFINITY, and INVALID.VTIME of type VTime, which 
define, respectively, the smallest, largest, and invalid time values. 

— INVALID.VTIME must be less than ZERO. 

— The insertion operator (<<) for class ostream, for type VTime. 

The application interface is implemented through the object-oriented features 
of the C-|— I- language. The simulation kernel is built from several classes, allowing 
the user to define a system configuration by specifying the classes to use, without 
rewriting system code. Application specific code is derived from the warped 
kernel. This allows application code to transparently access kernel functions and 
is restrictive enough to hide communication and Time Warp details from the 
user. This section describes what is necessary for an application writer to provide 
the WARPED kernel, and what the simulation kernel provides to the application 
in return. To use the warped kernel, the application programmer must provide 
three class definitions corresponding to the logical process (LP), the notion of 
state for that LP, and a definition (or definitions) for events. 

LPs form the core of the discrete event simulation. An LP represents an entity 
that can send/receive events to/from other LPs. As a result of these events, chan- 
ges are made to the LP’s internal state (and output may result). Figure 0 illust- 
rates the application and the kernel interfaces presented by the warped system. 
The interface as seen by an user’s LP is represented by the UserLogicalProcess 
class definition. The class definition is divided into two parts. The first part is the 
set of methods that the kernel provides to the LP. These methods are provided 
by the kernel to the LP for communication (sendEvent, getEvent), querying the 
kernel for information (haveMoreEvents, getSimulationTime) and for accessing 
its state (getState). In addition to these methods, there are some internal me- 
thods that the kernel calls periodically. These include message polling primitives 
to check for the arrival of messages from remote processors and garbage collec- 
tion primitives. The second part consists of a set of methods that the application 
writer overrides. The kernel will call these methods at various times through out 
the simulation. Each method in this set has a specific function. The initialize 
method gets called on each LP before the simulation begins. This gives each LP 
a chance to perform any actions required for initialization. For example, initia- 
lization might include opening files, setting up the initial state of an LP or the 
transmission of initial setup events to the distributed processes in the simulation. 
Conversely, the method finalize is called after the simulation has ended. This 
allows the LPs to “clean up” after themselves, perform actions such as closing 
files, compute statistics, and produce output. The method executeProcess of 
an LP is called by the kernel whenever the LP has at least one event to process. 
The kernel calls allocateState in an LP when it needs the LP to allocate a 
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state on its behalf. deallocateState is called by the kernel to hand back a state 
to the application when it is done with it. At this point, the application may 
deallocate it, or store it for later use. 

Any LP will have some state that needs to be defined. The LP modifies its 
state in response to various events that it receives. This behavior is completely 
user application specific and the application must define certain methods related 
to state for the simulation kernel to call. These methods include the creation and 
the duplication of the state. Figure |3 illustrates the user application’s interface 
to the state. The method copyState is called by the kernel to copy the data from 
the UserState into a newly created state which is then archived (for rollback 
recovery purposes). This method must be overridden by the user application. 
If the application’s definition of state contains no pointers, then a bitwise copy 
is adequate for this method. If the application contains pointers in its state, or 
objects that contain pointers, then this method has to take appropriate actions 
to copy the pointers “correctly”, as defined by the needs of the application. This 
is necessary because the kernel has no knowledge about the user application’s 
state. 

Events represent the communication between the LPs. Figure El illustrates 
the definition of the UserEvent class. Once again, the definition is a two part 
definition wherein one set of methods is provided by the kernel and the other set 
is overridden by the application writer. The method setReceiver allows the ap- 
plication to set the simulation id of the receiving LP0. setReceiverTime allows 
the application to set the simulation time that this event should be received at. 
The methods serialize and deserialize are provided so that the application 
may maintain architectural transparency and portability among events. It is also 
necessary for checkpointing in optimistic fossil collection and failure recovery. 

The design of the warped API was motivated by several design issues. These 
issues were central to the object-oriented design of the system and needed to 
be solved for constructing a simple and extensible programming interface. For 
example, when the kernel needs information about data structures within the 
application, they can be passed into the kernel in two ways : through template 
classes or through virtual interface methods. One example of this is the state 
class definition. The user state can be passed into the kernel through templates. 
All that is required is that the UserLogicalProcess class be templatized on 
UserState. However, to reduce overall compilation time, static executable size 
and facilitate the use of different types of states, the templatization approach 
was avoided. The convention currently followed is to have the LP and the kernel 
share the responsibility of allocating, maintaining and deallocating the state 
through the use of virtual methods. Although the common perception in scientific 
computing is that abstraction is the enemy of performance, we have found that 
the extensive use of virtual methods and other abstractions does not drastically 
affect performance. When kernel data or functions need to be made available 
to the user, they can be accessed by one of two mechanisms: through the C-I--I- 

^ As it is the user’s responsibility to register an LP with a unique simulation id, the 
application can use the setReceiver method to connect LPs together. 
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inheritance mechanism (classes that the user defines must be derived from kernel 
defined classes), and through “normal” function calls to methods defined by 
objects in the warped kernel. 

In addition, avoiding the use of templates facilitates the distribution of source 
code as stand alone libraries which do not require recompilation. This enables the 
development of “object factories” by independent vendors. With object factories, 
vendors can permit different users to use various components from their object 
factories without revealing the source code. To enable this type of “plug-and- 
play” , C-| — h composition was carried out in preference to inheritance in the 
source code. Composition also helps in achieving dynamic algorithm/method 
reconfiguration {i.e., reconfiguration “on-the-fly” without recompilation). 

Also, avoiding the use of templates makes the warped system simpler to 
port to different compilers on different architectures. To achieve interoperabi- 
lity on heterogeneous platforms, the serialization and deserialization operations 
play a vital role. Currently serialization and deserialization of events as well as 
states is supported. These operations are invoked only when events or states 
cross architecture boundaries. Serialization and deserialization is also applied to 
checkpointing to facilitate failure recovery. 

In the current version of warped, there are several Time Warp implemen- 
tation optimizations that can be turned on/off. A configuration file is used to 
allow the user to change between the options of the simulation kernel at com- 
pile time. These options fall under several broad categories: Schedulers, Fossil 
Managers, State Managers, Memory Managers, and Time Warp optimizations 
(such as dynamic cancellation jO], dynamic checkpointing m and dynamic mes- 
sage aggregation mi)- The user specifies a selection from this set of options 
and compiles this selection. A better way to implement this is through dynamic 
configuration. Each optimization is implemented as a specific function and at 
run-time, a simulation object (or some central configuration object) can dyna- 
mically select and reconfigure (through function pointers) the optimization and 
switch between optimizations if the need arose ng. 

4 Applications for warped 

Several applications have already been developed that use the warped kernel. 
These applications primarily belong to two application domains: a queuing model 
simulation library called KUE, and TyVIS, a simulation kernel for the VHDL 
hardware description language |Z|. KUE is a simple package developed for de- 
bugging, testing, and initial profiling of warped and any extensions thereof. 
TyVIS is a larger package designed to stress the simulation kernel with large 
examples of digital systems. It also demonstrates the extensibility of the warped 
kernel. The developers hope that other investigators will implement additional 
applications with warped which they can include as part of the distribution. 
As space constraints prevent us from presenting the performance of the war- 
ped system, the rest of this section is devoted to the description of warped 
applications. 
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4.1 KUE: A Queuing Model Library 

The KUE system is a library of queuing models built on top of the warped 
kernel. KUE is a set of C++ classes that enable the creation of parallel queuing 
applications. XKUE is a TCL/TK front end for queue to allow “point and 
click” creation of queuing models. The KUE library contains class definitions 
of seven different queuing objects (source, fork, join, delay, queue, server and 
sink objects). Each object class definition encapsulates the functionality of the 
queuing object in accordance with the warped interface. Two examples are 
distributed with the warped kernel that make use of the KUE libraries. 

The first, SMMP, is designed to simulate several processors, each with their 
own cache, and sharing a global memory. The model is generated by a pro- 
gram which lets the user adjust the following parameters: the number of proces- 
sors/caches to simulate, the number of LPs to generate, the speed of cache, the 
speed of main memory, and the cache hit ratio. The second example, RAID, is 
a simulation of a nine disk RAID level 5 array of IBM 0661 3.5” 320MB SCSI 
drives with a flat-left symmetric parity placement policy. Sixteen processes ge- 
nerate requests for data stripes of random lengths and locations. These requests 
are sent to fork processes which split them into specific disk-level requests accor- 
ding to the RAID placement policy. The nine server processes, one per simulated 
disk, process the requests in a first-come first-served fashion. After processing 
each request, the disks route their responses back to the originating processes. 
Both these sample queuing applications posses class definitions that derive from 
the seven basic queuing model definitions in the KUE library. Further details 
regarding these applications are available in the literature H2|. 



4.2 TyVIS: A Parallel VHDL Simulation Kernel 

The TyVIS VHDL simulation kernel was designed to take advantage of the 
object-oriented design of the warped kernel. It requires no modifications to the 
kernel, yet extends warped with full VHDL simulation capability (as described 
in jZj). Its implementation takes advantage of several design features of war- 
ped, and even reuses some of warped’s basic classes for TyVIS’s internal data 
structures. The main class of TyVIS is VHDLKernel, which is derived from the 
UserLogicalProcess class. 

The semantics of VHDL require that certain events generated during a simu- 
lation cycle not be applied to a signal’s value, based upon each event’s timestamp. 
This process is called marking, and is best implemented with a time-ordered 
queue. Rather than writing an entirely new data structure, the OutputQueue 
class of the warped distribution was reused, becoming a base class for the 
MarkedQueue class. The public interface to MarkedQueue is identical to that of 
OutputQueue; all additional data members and methods are private. This reuse 
of the existing code allowed the MarkedQueue class to be written and debugged 
in a matter of a few hours. Also, since MarkedQueue only accesses the public 
interface of OutputQueue, any changes in the implementation of OutputQueue 
will be transparent. 
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Fig. 3. A synopsis of warped’s class derivation hierarchy 



Each VHDL process has a unique state class which defines the VHDL signals 
and local variables that the process can access. This state class is built from 
warped’s UserState class with the necessary user-defined methods. This allows 
the Time Warp functions of state queuing, rollbacks, and garbage collection to 
proceed normally. The only requirement to the state class for this is that the 
class define operator=. 

Processes are invoked by calling VHDLKernel: :executeProcess(), which 
overrides the similar method in the UserLogicalProcess class. This method 
updates LVT and applies all events in the input queue occurring on any signals 
contained in the process at the current time. The specific VHDL process code is 
then executed by calling the object’s executeVHDL method, supplied by the user. 
When the process returns control to the VHDL kernel, the kernel then determi- 
nes which newly generated events need to be transmitted to other processes, and 
transmits them, using the sendEvent call from the warped kernel. Eventually, 
control is returned to warped. If a process is rolled back, the VHDL kernel never 
knows about it, since all related processing is contained entirely in the warped 
code, lower down in the derivation hierarchy. Complete replacement of the war- 
ped kernel with a conservatively synchronized simulation kernel would have no 
effect on TyVIS; it is completely isolated from whatever processing is performed 
by WARPED. A synopsis of the warped class derivation tree is illustrated in Fig. 
El Base class definitions form the root of the derivation hierarchy. Figure El also 
depicts the warped classes reused by the TyVIS VHDL simulation kernel. 
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5 Conclusions 

The WARPED Time Warp simulation project is an attempt to produce a widely 
available, highly portable, and an object-oriented Time Warp simulation kernel 
complete with operational applications for testing and analysis. The software 
is written in C-|— I- and uses the MPI portable message passing interface. The 
system operates on a distributed or shared memory multiprocessor as well as 
on a network of workstations. Several applications have been developed and are 
jointly released with the software. 

The intent of this effort is to make a testbed available for experimentation 
and analysis of Time Warp and all its affiliated optimizations. For this purpose, 
an object-oriented design approach has been followed with the aim of making 
the software easy to extend. Our experiences in the design and development of 
WARPED were also presented. In addition, a synopsis of the application program- 
ming interface of warped was also presented. We hope that as investigators use 
and extend the capabilities of the kernel that we will be allowed to integrate 
those extensions into the basic kernel release so that others can likewise benefit 
from, and independently confirm, the analysis of the extensions. Furthermore, 
we expect that additional test cases for the existing (and ideally, new) applicati- 
ons will be independently developed and submitted for inclusion into the kernel 
release (and thereby promoting reuse of source code). 
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Abstract. A program for simulation of the dynamics of high intensity 
charged particle beams in linear particle accelerators has been developed 
in C++ using the POOMA Framework, for use on serial and parallel ar- 
chitectures. The code models the trajectories of charged particles through 
a sequence of different accelerator beamline elements such as drift cham- 
bers, quadrupole magnets, or RF cavities. An FFT-based particle-in-cell 
algorithm is used to solve the Poisson equation that models the Coulomb 
interactions of the particles. The code employs an object-oriented design 
with software abstractions for the particle beam, accelerator beamline, 
and beamline elements, using C++ templates to efficiently support both 
2D and 3D capabilities in the same code base. The POOMA Framework, 
which encapsulates much of the effort required for parallel execution, pro- 
vides particle and field classes, particle-field interaction capabilities, and 
parallel FFT algorithms. The performance of this application running 
serially and in parallel is compared to an existing HPF implementation, 
with the POOMA version seen to run four times faster than the HPF 
code. 



1 Introduction 

Particle accelerators have played a central role in shaping our present under- 
standing of the fundamental nature of matter. At the same time, the application 
of accelerator theory and technology has contributed to substantial progress in 
other branches of science and technology. This historical trend is expected to 
continue with particle accelerators playing an increasingly important role in ba- 
sic and applied science. As examples of recent applications, many countries are 
now involved in efforts aimed at developing accelerator-driven technologies for 
transmutation of radioactive waste, disposal of plutonium, energy production, 
and production of tritium. Additionally, next-generation spallation neutron sour- 
ces based on similar technology will play a major role in materials science and 
biological science research. Finally, other types of accelerators such as the Large 
Hadron Collider (LHC), the International Linear Collider (ILC), and fourth- 
generation light sources will have a major impact on basic and applied scientific 
research. 

* This work was performed under the auspices of the U.S. Department of Energy by 
Los Alamos National Laboratory under Contract No. W-7405-Eng-36. 
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For all of these projects, high-resolution modeling far beyond that which has 
ever been performed by the accelerator community is required to reduce cost and 
technological risk, and to improve accelerator efficiency, performance, and relia- 
bility. Indeed, such modeling is essential to the success of many of these efforts. 
For example, high average power linear accelerators, such as those needed for 
tritium production, must operate with extremely low beam loss 0.1 nA/m) 
to prevent unacceptably high levels of radioactivity. To ensure that this requi- 
rement will be met, it is necessary to perform very high-resolution simulations 
using on the order of 100 million particles in which the beam propagates through 
kilometers of complicated accelerating structures. These simulations can only be 
performed on the most advanced high performance computing platforms using 
software and algorithms targeted to parallel and distributed environments. The 
calculations require performance of hundreds of GFLOPS to TFLOPS, and core 
memory requirements of hundreds of GBytes. 

The beam dynamics modeling effort has concentrated so far on parallel cal- 
culations for the design of proton linear accelerators (linacs). Such accelerators 
are the machines of choice for applications including radioactive waste treatment 
and tritium production. Two-dimensional and fully three-dimensional beam dy- 
namics codes that take into account both external accelerating and focusing 
fields, as well as the inter-particle Goulomb forces in the beam are in an ad- 
vanced stage of development and have already been used for accelerator design 
studies Pq. This paper describes the design and implementation of a par- 
allel application used to model high-intensity charged particle beams moving 
through a linear accelerator, using an object-oriented design in G-I--I- based on 
the POOMA Framework pp. The performance of this code is compared to an 
HPF implementation of the application, running serially and in parallel on the 
SGI 0rigin2000 parallel computers available at Los Alamos National Laboratory. 



2 Simulating Linear Accelerators 

To simulate the motion of charged particles through a linear accelerator, we have 
employed an object-oriented (00) software design in our application. Using an 
00 design strategy makes it easier to develop modular, maintainable code which 
can easily be extended to incorporate new algorithms, simulation components, 
and capabilities. The characteristics of linear accelerators, consisting of sequen- 
ces of beamline elements through which particles move as they are accelerated, 
lend themselves quite well to being modeled using an 00 design. We can consi- 
der this system as being comprised of the following abstractions. 

Beamline Elements consist of the distinct portions of the linear accelerator be- 
amline through which the particles move. Particles interact with the elements in 
various ways as they propagate through them; for example, quadrupole magnet 
elements focus the beam as the charged particles move through their magnetic 
fields. 
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The Beamline comprises the collection of different beamline elements which make 
up the linear accelerator, in the order the elements are encountered by the par- 
ticles. 

The Beam is the set of charged particles being accelerated by the system. Par- 
ticles have characteristics such as phase-space coordinates, charge, and mass, 
and move through the beamline subject to the equations of motion for a linear 
accelerator. 

The Accelerator is the entire system, comprising the beamline and the beam. 

As the particles in the beam move through the beamline, passing through 
each beamline element, they experience both external forces due to the element 
they are passing through and internal forces due to the space-charge interaction 
of the particles with each other. The space-charge forces are calculated using a 
standard FFT-based particle-in-cell (PIC) algorithm for a collisionless system 
P E] In this algorithm, we first solve the Poisson equation 

V^0(r) = 47rp(r) (1) 

to find the electrostatic potential (f>{r) from the charge density field p{r) of the 
particles. From 4>{r), the space-charge force Fi{r) on each particle with charge 
Qi is computed using 



E{r) = -V(j){r) (2) 

F,(r) = q,E{r). (3) 

The standard PIC algorithm, used in codes discussed here, may be summarized 
as: 

1. Scatter charge onto a grid to obtain a discretized charge density p(r); 

2. Solve (Pi to determine the electrostatic potential 4>{r) on a grid; 

3. Compute the electric field vectors E{r) from m on a grid by finite difference 
methods; 

4. Gather the electric field vectors from the grid to the particle positions, and 
calculate the force on each particle Fi{r) using 0. 

The beamline element forces and the space-charge interaction forces result in 
changes to the momentum and position of the particles, causing them to accele- 
rate through the beamline. 

3 Implementation Using the POOMA Framework 

Figure P presents an overview of the object-oriented design of the particle acce- 
lerator simulation code, illustrating the abstractions for the accelerator, beam, 
and beamline components. Each solid box represents an object; the top half of 
each box indicates the object name, while the bottom half indicates the impor- 
tant methods or variable for the object. Lines terminating in arrows indicate 
inheritance (“is a”) relationships; lines originating from diamonds indicate “has 
a” relationships. 
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The simulation code is implemented in ANSI/ISO C++ using the POOMA 
Framework 00, and making use of the template facilities of C++. The objects 
shown in Fig.0correspond to C++ classes used in the application. These classes 
are templated on the number of dimensions and the floating-point type, making 
it possible to use the same source code base for simulations of different dimen- 
sions or data type precision. For the small fraction of the code which cannot 
be generalized to a dimension-independent formulation, specializations of the 
relevant functions are provided. At present, this specialization has been done for 
two and three dimensions. 




Fig. 1. A summary of the object design for the linear accelerator simulation code. An 
Accelerator consists of a Beam (a collection of charged particles) and a Beamline (a 
set of N BeamlineElems) 



The Accelerator class contains the primary components of the simulation, 
namely a Beam instance and a Beamline instance. When created. Accelerator 
objects determine simulation parameters and beamline components from an 
input file, and initialize their Beam and Beamline accordingly. The runO method 
carries out the steps of the computation, by calling the integrate () method of 
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the Beamline. The Beemiline in turn propagates the particles through each in- 
dividual BecunlineElem, which are polymorphic classes that compute specialized 
forces used to update the momentum and position of the Beam particles. The 
BeamilineElem computations invoke the spaceCharge () method of the Beam to 
calculate the space-charge interaction forces for the particles. 

The accelerator simulation code is built upon the POOMA Framework, a 
templated C-|— I- class library which provides C-|— I- abstractions for physical 
quantities such as particles and fields. POOMA provides N-dimensional parallel 
data structures for the beam particles and for the space-charge field quantities 
such as the charge density p{r), electrostatic potential and electric field 

E{r). POOMA encapsulates the complexity of providing a parallel run-time 
system, maintaining parallel data structures, and efficiently performing data- 
parallel computations. OH — I- template techniques such as expression templates 
ID are used to implement a data-parallel syntax for expressions involving field 
and particle quantities; such expressions are evaluated at the same speed as 
hand-coded evaluation loops P). POOMA allows the user to write scientific si- 
mulation codes that can be run serially or in parallel with no change to the 
source code. The Beam class in Fig. Pluses POOMA ParticleAttrib objects for 
the particle position and momentum data, and POOMA Field objects for p{r), 
4>{r), etc. 

The solution of the Poisson equation from (QI) is computed with an FFT- 
based algorithm that uses multi-dimensional FFT routines from the POOMA 
Framework. POOMA also provides a number of particle- field interaction capabi- 
lities such as gather /scatter algorithms with different interpolation schemes. At 
present, both cloud-in-cell |S] and nearest-grid-point interpolation mechanisms 
are supported; additional algorithms are straightforward to implement and use 
with the POOMA gather/scatter routines. 




(a) (b) 

Fig. 2. Visualization of a sample 2D accelerator simulation, (a) Particle positions co- 
lored by kinetic energy, (b) Charge density field p(r) 
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The POOMA Framework also provides a run-time visualization option that 
can be used to visualize particle and field data structures either at run-time 
or post-processed from data files. Figure 0 shows a sample visualization from 
a 2D linear accelerator simulation, using the POOMA run-time visualization 
facilities. Figure Hi displays the positions of particles within the accelerator 
colored by their kinetic energy, and Fig. 1^3 displays the charge-density field p{r) 
that results from scattering the electric charge of the particles onto a grid. 

The use of a toolkit such as the POOMA Framework for development of 
high-performance simulation codes has proven to be an important tool in the 
implementation of the linear accelerator simulation code. The strong support in 
C-|— I- for object-oriented programming features such as polymorphism, inheri- 
tance, and data abstraction, coupled with C-|— l-’s template facilities, makes it 
a useful language with which to implement a scientific application such as this. 
Also, templates provide a mechanism to avoid unnecessary run-time costs nor- 
mally associated with the use of languages that support 00 design, while still 
maintaining a high degree of flexibility and extensibility in a program. 

Software development frameworks such as POOMA have proven to be a po- 
werful tool for high-performance parallel scientific applications. The POOMA 
Framework has been used for several other codes in fields such as neutron trans- 
port PI , and as a basis for other frameworks such as Tecolote m- Several other 
libraries such as PETSc HH, which includes several linear and nonlinear sy- 
stem solvers, and Overture H2|, which provides explicit support for overlapping 
grids in complex geometries, are used as a basis for parallel simulation codes in 
a wide range of applications. The advantage of using these different systems is 
clear: building your simulation code on top of an existing parallel application fra- 
mework simplifies application design, shortens development time, and improves 
portability to different parallel platforms and communication mechanisms. 

4 Performance 

Table Hand Fig. 0 compare the performance of the POOMA-based linear acce- 
lerator simulation code with a similar application written in High-Performance 
Fortran. This comparison was carried out on the Silicon Graphics 0rigin2000 
parallel supercomputers at Los Alamos National Laborary, using the SGI G-l— I- 
compiler (version 7.2) and the Portland Group HPF compiler (version 2.2). The 
calculations were all 2D simulations with a beamline comprising ten beamline 
elements. 

Table Q] shows running times for a 2D fixed-size problem on different numbers 
of processors. The problem modeled 10® particles moving through 10 beamline 
elements, using a 256^ grid for the space-charge computation. The codes used 
were an HPF program and two POOMA-based versions that differed in their use 
of FFT routines. The POOMA code labeled “G-G” in the table used a complex- 
to-complex FFT algorithm, and the POOMA code labeled “R-G” used real-to- 
complex FFT routines. All three codes produced equivalent diagnostic results. 
The table gives the total simulation time (averaged across the processors) and 
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the amount of time spent in the gather/scatter and FFT portions of the space- 
charge computation, which is the single largest part of the simulation time. 



Table 1. Run times (seconds) for a fixed problem size (10® particles, 256^ grid) 





Total 


Gather/Scatter 


FFT 


Nodes 


R-C C-C 


HPF 


R-C 


C-C 


HPF 


R-C C-C HPF 


1 


537.2 608.6 1998.8 


385.6 392.5 1500.2 


31.5 83.5 120.1 


2 


312.0 340.2 1300.8 


197.6 198.3 1037.3 


23.5 44.7 


70.8 


4 


171.2 184.2 


873.2 


99.1 


99.4 


714.9 


13.6 24.2 


51.3 


8 


96.9 104.0 


467.2 


49.3 


49.7 


384.3 


7.4 13.0 


24.5 


16 


61.7 65.0 


195.8 


24.6 


24.7 


157.5 


4.7 7.4 


11.2 


32 


44.6 46.8 


157.1 


12.2 


12.2 


120.8 


3.9 5.4 


14.4 



From the first three columns of Table Q which list the total simulation time, 
we see that the POOMA codes outperformed the HPF codes by a factor between 



Fixed Problem Size: Parallel Speedup 




0 4 8 12 Proifssors 20 24 28 32 



Fixed Problem Size: HPF vs POOMA 

256 X 256 grid, 1000000 particles 



[□POOMA C-C| 
!■ POOMA R-cl 




(a) (b) 

Fig. 3. Performance comparison between POOMA and HPF implementations of the 
accelerator simulation code, in two dimensions, (a) Parallel speedup (single-processor 
simulation time divided by multi-processor simulation times) for POOMA and HPF 
codes for a simulation of 10® particles on a 256^ grid, (b) Relative speedup of POOMA 
over HPF codes for the same problem (HPF simulation time divided by POOMA 
simulation time) 





32 



W. Humphrey et al. 



3 and 5. Figure which shows the parallel speedup of the three codes, and 
Fig. Ob, which shows the speedup of the two POOMA codes relative to the HPF 
code, both demonstrate that the improvement is consistent from one to 32 nodes. 
The largest improvement between the timings for the HPF code and the POOMA 
codes is in the time to perform the gather and scatter operations, shown in the 
middle three columns of Table 0 The times to perform the FFT operations for 
the POOMA codes were also shorter than for the HPF code, particularly for 
the real-to-complex version of the POOMA code, but for this problem size the 
gather/scatter time represents the majority of the computation. 

The performance gain with a real-to-complex FFT, which requires less sto- 
rage and fewer elements in the FFT calculation, is particularly noticeable for 
problems with a small number of particles per cell. A set of simulations of 10® 
particles on a 256^ grid using increasing numbers of nodes is summarized in 
Table 1 using the C-C and R-C versions of the POOMA application. Here, the 
relative improvement in performance using the real-to-complex version is much 
more noticeable than in the 10®-particle simulation. While the parallel speedup 
for the C-C version is greater than that of the R-C version, the R-C version 
has much better single-node performance and reaches the point of diminishing 
parallel returns earlier than the C-C code. 

Table □ and Table El demonstrate that the gather and scatter portions of 
the POOMA codes scale reasonably well with the number of processors. The 
POOMA version performs an initial particle load-balancing that equally parti- 
tions the particles among processors and contributes to the nearly linear scaling 
behavior of the gather/scatter routines. This highlights a major difference bet- 
ween the POOMA and HPF simulation codes: the parallelization strategy for the 
particle data. POOMA employs a spatial decomposition strategy, which keeps 
particles local to the processor containing their charge density field and elec- 
tric field data by reassigning particles to processors when the particle positions 
are changed. With a spatial decomposition, gather/scatter operations between 
the particles and fields require a minimum of communication. The HPF code 
employs a static partitioning of particles across the processors, requiring extra 
communication for the gather/scatter phase. In both cases, a roughly equal por- 
tion of the particles is kept on each processor. The extra time spent by POOMA 
to maintain particle locality and to perform the initial load balancing is more 
than made up for by reduction in the times for gather/scatter operations. 

For large problem sizes, the majority of the computation time is spent in par- 
ticle gather/scatter operations. In addition to the use of a spatial decomposition 
strategy to minimize the communication during gather and scatter calculations, 
POOMA provides an option to cache the particle-field interpolation generated 
in one gather or scatter operation for later gather/scatter calls. Interpolation 
between particle and field positions involves determination of nearest grid po- 
sitions and interpolation weights, which do not change from one gather/scatter 
call to the next unless the particle positions change. For these linac simulation 
codes, the particles do not move between the time when charge is scattered onto 
the charge-density field and when the electric field vectors are gathered back to 
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Table 2. Run times (seconds) for a fixed problem size (10® particles, 256^ grid) 





Total 


Gather/Scatter 


FFT 


Speedup 


Nodes 


R-C 


C-C 


R-C 


C-C 


R-C 


C-C 


R-C 


C-C 


1 


93.4 


158.6 


38.8 


39.4 


31.2 


83.4 


- 


- 


2 


58.9 


84.6 


20.0 


19.9 


23.4 


25.3 


1.59 


1.87 


4 


32.5 


45.9 


9.7 


9.9 


13.3 


24.3 


2.87 


3.46 


8 


17.8 


24.9 


4.7 


4.8 


7.1 


12.8 


5.25 


6.40 


16 


10.9 


14.3 


2.3 


2.4 


4.1 


6.8 


8.57 


11.09 


32 


8.9 


10.2 


1.3 


1.3 


3.5 


4.7 


10.49 


15.55 



determine the electrostatic force. By caching the interpolation information from 
the scatter and reusing it during the gather, the gather operations in the 2D 
POOMA codes are seen to run up to three times faster than the corresponding 
scatter operation. 

Table 13 compares the execution times for the POOMA and HPF versions of 
the linac simulation code on two different parallel architectures. In addition to 
the 0rigin2000 machines at Los Alamos National Laboratory, the codes were run 
on the Cray T3E at the National Energy Reseach Scientific Computing Center. 
On the T3E, the POOMA code was compiled with the Kuck and Associates KCC 
3.2b2 compiler (version 3. 2d), and the HPF code was compiled with the Portland 
Group HPF compiler (version 2.4). The results in TableElare for a 2D simulation 
of 500, 000 particles on a 256^ grid, and the real-to-complex FFT version of the 
POOMA code was used. On the T3E, the POOMA version executes from just 
about the same speed to 50 percent faster than the HPF code. This scaling is not 
as dramatic as what is observed on the Origin2000 machines, but is consistent 
with the previous results in that the difference in times is due primarily to faster 
gather/scatter operations in the POOMA implementation. 



Table 3. Run times (seconds) for different architectures (500000 particles, 256^ grid) 



Nodes 


SGI Origin2000 


Cray T3E 


R-C 


HPF 


R-C HPF 


1 


291.0 


1064.7 


473.2 586.6 


2 


170.1 


708.0 


263.5 370.6 


4 


113.5 


397.5 


143.0 198.9 


8 


65.9 


247.2 


80.3 110.8 


16 


38.7 


107.3 


50.9 63.6 


32 


31.8 


107.6 


36.7 35.8 



5 Conclusions 

Using the POOMA Framework, a C++ application which models the motion 
of high-intensity charged particle beams through a linear accelerator has been 
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developed that runs substantially faster than an equivalent HPF application on 
a number of different platforms. This performance increase can be attributed in 
part to the use of a spatial decomposition strategy for the parallel computation 
in the POOMA version of the code that reduces the parallel communication re- 
quired during parallel gather and scatter operations, and in part to the use of 
a real-to-complex FFT algorithm in the POOMA version. The linac simulation 
code employs an object-oriented design strategy; by using the POOMA Frame- 
work as a basis for the development, the design is able to focus on the specific 
physics abstractions of the accelerator in a modular, extensible manner. POOMA 
automatically provides the parallel data structures and algorithms, efficient eva- 
luation of data-parallel expressions, and abstractions of the hardware-specific 
parallel communication issues for the accelerator code. 
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Abstract. We describe the design and implementation of high perfor- 
mance numerical software in Java. Our primary goals are to characterize 
the performance of object-oriented numerical software written in Java 
and to investigate whether Java is a suitable language for such endea- 
vors. We have implemented JLAPACK, a subset of the LAPACK libr- 
ary in Java. LAPACK is a high-performance Fortran 77 library used to 
solve common linear algebra problems. JLAPACK is an object-oriented 
library using encapsulation, inheritance, and exception handling. It per- 
forms within a factor of four of the optimized Fortran version for certain 
platforms and test cases. When used with the native BLAS library, JLA- 
PACK performs comparably with the Fortran version using the native 
BLAS library. We conclude that high-performance numerical software 
could be written in Java if a few concerns about language features and 
compilation strategies are addressed. 



1 Introduction 

Java P has achieved rapid success due to several key features. Java bytecodes are 
portable, so programs can be run on any machine that has an implementation 
of the Java Virtual Machine (JVM). Java provides garbage collection, freeing 
programmers from concerns about memory management and leaks. The language 
contains no pointers and dynamically checks array accesses, which help avoid 
common bugs in C programs. Java is establishing itself as a language of choice 
for many software developers. 

Java is attractive to the scientihc computing community for the same rea- 
sons. However, several factors limit Java’s inroads. First, Java performance has 
been a source of concern. Many of the attractive features of Java caused early in- 
terpreted versions of the JVM to perform poorly when compared with compiled 
languages like Fortran and C. Second, the absence of a primitive complex type 
presents another obstacle, as many numeric codes make extensive use of complex 
numbers. Finally, several language features that make numeric codes less cum- 
bersome to write, such as operator overloading and parametric polymorphism, 
are absent in Java. 

However, we believe that Java may be suitable for writing high-performance 
numerical software. The problems discussed above can be partially circumvented 
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by careful programming techniques. Furthermore, certain language features, such 
as primitive complex types, may be included in future versions of Java. To test 
our hypothesis that good performance can be achieved in Java, we designed 
and implemented JLAPACK, a proof-of-concept version of LAPACK in Java. 
LAPACK is a high-performance Fortran 77 library that solves common linear 
algebra problems. This library is well-suited for our study for several reasons: it 
is a standard library in the scientific community; it is used to solve common and 
useful problems; and it is highly optimized, giving us a hard performance bound. 

Our implementation of JLAPACK follows the Fortran version closely in spirit 
and structure. However, we did not write Fortran-style code in Java. JLAPACK 
employs object-oriented techniques such as inheritance, dynamic dispatch, and 
exception handling. We use classes to represent vectors, matrices, and other 
objects. We use exceptions for error handling. For performance analysis, we 
ran our code using a fully compliant JVM, with bounds checking and garbage 
collection enabled. JLAPACK performs within a factor of four of the optimized 
Fortran version for certain platforms and test cases. 

2 LAPACK 

LAPACK 0 is a library of Fortran 77 routines for common linear algebra pro- 
blems, such as systems of linear equations, linear least square problems, eigen- 
value problems, and singular value problems. LAPACK uses block-oriented al- 
gorithms for many operations, providing more locality of reference and allowing 
the use of matrix-matrix operations. The library handles both real and complex 
numbers, with versions for both single and double precision representations. 
There are specialized routines for structured matrices, such as banded matri- 
ces, tridiagonal matrices, and symmetric positive-definite matrices. JLAPACK 
currently implements only the simple linear equation solver for general matrices 
(i.e., xGESV and the routines they require) with both blocking and nonblocking 
versions. 

LAPACK uses the Basic Linear Algebra Subroutines (BLAS) |21 ^ O El [3 
E] for many of its time-critical inner loops. Most high performance machines 
have BLAS libraries with machine-specific optimizations, called native BLAS. 
Generic Fortran 77 BLAS code is available and is distributed with LAPACK. 
For JLAPACK, we provided two versions: one implemented in Java, and the 
other employing vendor-supplied native BLAS. The latter version provides Java 
wrappers around the Fortran BLAS routines, using the native method call 
mechanism of Java. Bik and Gannon El have shown that native methods can be 
used to achieve good performance, and our findings support their results. 

3 JLAPACK 

JLAPACK and JBLAS are our Java implementations of the LAPACK and BLAS 
libraries, currently implementing the subset of the subroutines in both libraries 
that are used by the simple general equation solver. We follow the Fortran version 
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in spirit and in structure, with every Fortran subroutine corresponding to a Java 
method. We retain the Fortran naming conventions, providing implementations 
for four data types: single precision real (S), double precision real (D), single 
precision complex (C), and double precision complex (Z). 

Several goals influenced the design of JLAPACK. First, we wanted to en- 
capsulate all the information specifying a vector or matrix into a class. This 
information fits into two categories that should be kept orthogonal: the data 
and its shape. Second, we wanted to store matrix data in a one-dimensional 
array for two reasons: first, two-dimensional arrays in Java are not guaranteed 
to be contiguous in memory, so a one-dimensional array provides more locality 
of reference; second, accessing an element in a two-dimensional array requires 
bounds checks on both indices, doubling bounds checking overhead. Third, we 
wanted to allow matrices and vectors to share data. A vector object that repre- 
sents a column of a matrix should be able to use the same data as the matrix 
itself. Our final goal was to limit the number of constructor calls, as this is a 
known source of overhead in naive object-oriented programs. 

Our design contains three separate components: the JLASTRUCT package, the 
JBLAS package, and the JLAPACK package. The JLASTRUCT package supplies the 
vector, matrix, and shape classes used by the library. The JBLAS and JLAPACK 
packages contains the BLAS library code and LAPACK library code respectively. 
Both contain four classes, one for each data type. Because there are no instance 
members in either class, all the methods are static. Each method in the JBLAS 
classes corresponds to a subroutine in the BLAS library and each method in the 
JLAPACK classes corresponds to a subroutine in the LAPACK library. We now 
discuss in detail the design of these packages. 

3.1 The Vector, Matrix, and Shape Classes 

In Fortran 77, information about the shapes of vectors and matrices must be 
represented as scalar variables and passed as extra arguments to every routine 
manipulating vectors and matrices. The vector and matrix classes in our design 
encapsulate this information into the abstraction of shape. There are vector and 
matrix classes for each of the four data types. 

The class JLASTRUCT. Vector implements two methods: 
eltAt(i) returns the ith element in the vector 
assignAt(val, i) stores val in the vector’s zth element 

The class JLASTRUCT. Matrix implements for matrices: 

EltAt(i, j) returns the element at location (z, j) 
assignAt(val, i, j) stores val at location (z, j) 
colAt(i, v) aliases the vector v to the zth column 
rowAt(i, v) aliases the vector v to the zth row 

suhmatrix(i, j, r, c, m) aliases the matrix m to the submatrix of size (r, c) 
starting at location (z, j) 

These classes contain two members: data and shape. The data member is a 
one-dimensional array of the appropriate type that is guaranteed to contain all 
the vector/matrix elements. The shape member is of type JLASTRUCT. VShape 
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(for vectors) or JLASTRUCT . MShape (for matrices), both classes being subclasses 
of the abstract class JLASTRUCT. Shape. The shape object defines the layout of 
the vector or matrix elements in the data array. 

An object of type JLASTRUCT . VShape contains the members: 
start: The index in data of the first vector element 
ten: The number of vector elements 

inc: The step size in data between consecutive vector elements 

Therefore, element i of a vector resides in slot j of its data array, where 
j = start + i * inc. Elements of a vector are evenly spaced in the data array. 

An object of type JLASTRUCT . MShape contains the members: start: The index 
in data of the first matrix element 
rows: The number of rows in the matrix 
cols: The number of columns in the matrix 

Id: The distance in data between the first elements in consecutive columns 

Therefore, matrix element (f, j) resides in location k of its data array, where 
k = start + Id * j + i. Note the column-major storage order and the zero-based 
indexing of arrays. This fits the Fortran model, allowing JLAPACK to use the 
same optimizations as the Fortran version and enabling native BLAS to be 
incorporated. 

This implementation allows objects to share data arrays. Figured shows how 
this may occur. The ability to share member objects improves the performance 
of methods used to obtain rows, columns, and sub-matrices of matrices. We will 
use the colAt{) method as an example, as its implementation applies to the other 
two. A naive implementation of this method would allocate new memory for the 
vector and new memory for its shape. Instead, the colAt{) takes as a parameter 
a vector that has already been allocated. Then, the method only supplies the 
vector’s data member (by giving it a reference to its own data), and updates 
its shape object. This approach eliminates unnecessary data copying and allows 
reuse of storage for temporary vectors and matrices. 

Boisvert et al. mu discuss an implementation for numerical libraries in Java 
that does not encapsulate vectors and matrices. They use two-dimensional arrays 
to represent matrices, and store information describing the shape of vectors and 
matrices in local variables, similar to the Fortran version. This approach requires 
several versions of each vector operation. One version must handle the case where 
a vector is stored in a one-dimensional array, and another must handle the case 
where a vector is a column of a matrix, and is stored in a two-dimensional array. 
They claim nm p. 41]: “If we are to provide the same level of functionality as 
the Fortran and C BLAS then we must provide several versions of each vector 
operation.” While this may be true of implementations of BLAS primitives, this 
should not affect the interface visible to the programmer. Our shape abstraction 
unifies and encapsulates these various cases. For efficiency, an implementation 
can still provide specialized routines for common cases. 
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3.2 Limiting Constructor Calls 

Excessive object creation is a well-known source of performance loss in object- 
oriented programs. Therefore, we use a technique (similar to that described 
by Dingle and Hildebrandt El) to limit the number of temporary vector and 
matrices created. Such objects are used locally in methods of the JBLAS and 
JLAPACK classes, so it is natural to place them within these methods. However, 
we make them private static class members. Note that this approach works only 
because none of the methods in the library are recursive and because we are 
ignoring issues of thread safety. 




Fig. 1. Sharing of data among multiple matrices and vectors. The 4x4 matrix A uses 
all 16 elements of its data array. Matrix B is assigned to be a submatrix of A. It shares 
the same data object as A, but only uses 4 elements of the array. Vector C represents 
one row of matrix B. Again, it shares the data object with A and B, but only uses two 
elements. Note how the shape parameters specify exactly where the data is stored. 
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3.3 Method Granularity 

Most of the work in the BLAS routines involves looping through columns of a ma- 
trix, accessing and modifying elements. An example is the scale routine, which 
scales a vector by a constant factor. A natural implementation would perform 
an eltAt{) and an assignAt() call for each element in the vector. Unfortunately, 
every call to eltAt() and assignAt{) must use the shape object to calculate the 
address of an element. The vector and matrix access equations above show the 
cost of these calculations. Boisvert et al. mu observe that the use of such methods 
is five times slower than an ordinary array access. We employ two mechanisms to 
overcome this overhead: aggregate operations and incremental access methods. 

Aggregate operations are operations performed on an entire vector or matrix 
at once. We converted operations such as the scale operation into methods in 
the vector and matrix classes. These methods exploit the bulk nature of the 
updates to access successive elements using incremental address computations. 
The calculation of the index into the data array consists only of an increment, 
instead of the multiplication and addition performed in the eltAt{) method. 

Another common type of operation in the library is to loop over a vector, 
accessing but not modifying its elements. Because the elements are being used 
instead of being modified, aggregate methods do not apply. To limit the number 
of index calculations, we include incremental access methods. These methods are 
used to retrieve the next element of a vector or the next column of a matrix, 
and are similar to the methods defined by the java.lang. Enumeration m 
interface. However, Enumeration does not handle primitive types, so we could 
not implement this functionality with the Enumeration interface. 



3.4 Complex Numbers 

Currently, Java does not provide a primitive type for complex numbers. However, 
complex numbers are required within the LAPACK library, so we provide two im- 
plementations for them. The first approach is to use a class JLASTRUCT . Complex, 
encapsulating complex values and arithmetic operations on them. While this is 
object-oriented, the overhead of using many small objects and calling a method 
for every arithmetic operation makes this approach unusably slow. 

Our second implementation of complex numbers simply inlines them, by 
making the data arrays of the vector and matrix classes twice as long, and storing 
the real and imaginary components contiguously in the array. Access methods 
change from eltAt() to realAt() and imgAtQ, and all arithmetic is performed 
inline. While this is an unattractive approach to dealing with complex numbers, 
it demonstrates the performance achievable with a primitive complex type. 



3.5 Discussion 

We discuss certain aspects of Java that make the development of JLAPACK 
difficult, and how we address them. 
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Two language issues hinder the development of JLAPACK: the absence of 
parametric polymorphism and the absence of operator overloading. The absence 
of parametric polymorphism required us to create a version of the JLAPACK 
library for each data type, which results in code bloat and extra programmer 
effort. Several projects miTiiTni have examined methods for providing parame- 
tric polymorphism, either by modifying the JVM or by a adding a preprocessing 
phase, and it is possible that the feature will be available in future versions of 
Java. 

The lack of operator overloading required us to write many methods in unna- 
tural forms. For example, the colAt() method intuitively should return a Vector 
object. Because we could not overload the assignment operator, we had to pass 
in the Vector object as a parameter to the method. Likewise, we had to write 
out in full detail mathematical operations such as scaling of vectors, instead of 
using a more natural mnemonic form, such as the *= operator. 

It is true that neither of these language features is fundamental, and that 
both represent “syntactic sugar” that would be removed in a preprocessing step. 
We ignored these issues while implementing JLAPACK, as our goal was to test 
our hypothesis about performance. However, the general user does not want to 
deal with such issues and is less apt to use a library that has such unnatural 
syntax. (Witness the success of Matlab, which virtually removes the difference 
between the linear algebraic representation of an algorithm and its realization 
in code.) We feel that Java will not be attractive to the numerical computing 
community until these features are integrated into the language. 

Our results document the overhead of encapsulating complex numbers in 
classes. Manual inlining is not the correct solution either, as it detracts from the 
readability of the code, replicates common operations, and presents a common 
source of bugs. While it is beyond the scope of this paper to determine the best 
mechanism for including primitive complex numbers in Java, this issue is under 
consideration by the Java Grande Forum uni, and must be resolved satisfactorily 
if Java is to be viable for numerical computing. 

4 Performance 

Performance is an overarching concern for scientific computation. The Fortran 
version of LAPACK has been highly optimized and represents our target level of 
performance. Therefore, we compare JLAPACK with the optimized Fortran ver- 
sion (compiled on the test platform with the vendor’s optimizing Fortran?? com- 
piler from LAPACK 2.0 source distribution downloaded from www.netlib.org) 
in all our results. In this section, we present the results from our experiments 
and discuss the reasons for both good and poor performance. 

We present performance results for solving the system of linear equations 
AX = B, using a coefficient matrix A and a right hand side matrix B whose 
entries are generated using a pseudorandom number generator from a uniform 
distribution in the range [0,1]. The same seeds are used in both the Fortran 
and Java versions, to guarantee that both versions solve identical problems. The 
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square matrix A has between 10 and 1000 columns. The matrix S has from 1 
to 50 columns. In every case, the leading dimension of the matrix equals the 
number of rows of the matrix. We separately timed the triangular factorization 
(xGETRF) and triangular solution (xGETRS) stages. The two data types used in 
timing were double precision real numbers (x=D) and double precision complex 
numbers (x=Z). For the factorization stage, we used block sizes between 1 and 
64. 

Tabled lists the platforms we used for timing. We ran Fortran versions for 
all Unix platforms, using the -fast option when compiling the Fortran library. 
On the DEC, where native BLAS libraries were available through the dxml 
library EZI, we measured performance with both the JBLAS classes and the 
native library. On the Spares, we ran two versions with kaffe [TCI ITT?] : one with 
dynamic array bounds checking turned on and the other with this feature turned 
off. We turned off array bounds checking in kaffe by modifying the native 
instructions that its JIT compiler emits. We measured performance without 
array bounds checking for two reasons. First, we wanted to quantify the cost 
of performing bounds checks. Second, global analysis of our code could prove 
that instances of java. lang. ArrayIndexOutOfBoundsException could never 
be thrown. While this cannot always be determined from the structure of the 
program, and no current implementation of the JVM systematically eliminates 
runtime bounds checking in this manner, such an optimization is likely to appear 
in future generations of JVM implementations. 

We manually compensated for certain deficiencies in javac to boost the per- 
formance of our code. The primary modification was loop unrolling. In our ex- 
periments, an unrolling depth of four gave the best performance. Unrolling does 
introduce a cost in code size. Unrolling loops in the JLASTRUCT. Vector class by 
factors of two, four, and eight increased class file sizes by 41%, 62%, and 104%. 

Table |2Ia)-(d) presents performance results for the three platforms listed in 
Table d including the test cases on the DEC using the native BLAS library. 
Analysis of the results reveals several interesting facts. First, the Java version 
with bounds checking enabled and inlined complex numbers performs within 
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Table 2. Performance results for double precision real (D) and double precision com- 
plex (Z) values. Entries represent the ratio of the JLAPACK running time to the 
LAPACK running time (lower is better). Results for the complex version that uses in- 
lined complex numbers are denoted by (I), and results for the version that used classes 
for complex numbers are denoted by (C). The results for the triangular factorization 
without blocking are denoted by F(nb), the results for the triangular factorization with 
a blocking factor of 16 are denoted by F(b), and the results for the solve are denoted 
by S. The label be denotes that bounds checking was enabled, and nbc denotes that it 
was disabled. The label r indicates a small matrix (100 by 100) was used so that the 
program could take advantage of caching. The label R indicates a large matrix (600 by 
600) that could not fit into the system cache was used. A — label denotes a missing 
entry, (a) Performance on a SPARCstation 5. (b) Performance on an UltraSparc 17. 
(c) Performance on a DEC Personal Workstation, (d) Performance of Native BLAS on 
a DEC Personal Workstation. 
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(d) 



a factor of four of the Fortran version for certain architectures and problem 
sizes. On the SparcStation 5, the Java version is about three or four times worse 
than the Fortran on the larger problem sizes for both the factorization and 
the triangular solve. As a side note, the interpreted Java implementation was 
unusably slow. 

Second, on the UltraSparc, for most of the cases with bounds checking ena- 
bled and inlined complex numbers, there is less than a factor of seven difference 
between the two versions. However, for the factorization with double precision 
numbers and blocking, the Fortran version performs about eleven times better 
than the Java version. This is because blocking significantly improves the per- 
formance of the Fortran version, but not of the Java version. Our hypothesis is 
that the variations in performance represent instruction scheduling effects. We 
examined the assembly code generated by the Fortran compiler on the Sparc- 
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Station 5 and on the UltraSparc, which represent different implementations of 
the same instruction set architecture. The code generated for the inner loops 
of several routines varied considerably, using different degrees of loop unrolling 
and different schedules. The kaffe JIT compiler generated identical instruction 
sequences for both platforms. We believe that the sub-optimal instruction sche- 
dule increases pipeline stalls and nullifies the improvements in spatial locality 
due to blocking. 

Third, the native BLAS library made a significant impact on performance, 
especially for the cases where blocking was used. Because LAPACK heavily 
relies on BLAS for its computations, using the native BLAS library brought the 
performance of JLAPACK close to the performance of LAPACK (within 15% 
for large problem sizes). This demonstrates that the object-oriented wrappers 
provided by JLAPACK were efficient. It also supports our hypothesis that poor 
instruction scheduling hurt performance in the pure Java version. 

Fourth, the impact of bounds checking is shown by the data generated on 
the Spares. For the test cases, removing bounds checking increased performance 
by 10% to 25%. The affect was slightly larger for the UltraSparc than the Spar- 
cStation, and slightly larger for the solution stage than the factorization stage. 

Finally, using classes to represent complex numbers performs very poorly. On 
all the platforms tested, the version that uses the Complex class is more than 
twice as slow as the version that inlined complex numbers. 

5 Related Work 

Several other projects investigate Java for numerical computing. The Java Nu- 
merical Toolkit nm is a set of libraries for numerical computing in Java. Its 
initial version contains functionality such as elementary matrix and vector ope- 
rations, matrix factorization, and the solution of linear systems. HPJava 1201 is 
an extension to Java, that allows parallel programming. HPJava is somewhat 
similar to HPF and is designed for SPMD programming. 

Several projects are developing optimizers for Java. Moreira et al. I2H are 
developing a static compiler that optimizes array bounds checks and null pointer 
checks within loops. Adl-Tabatabai et al. m have developed a JIT compiler that 
performs a set of optimizations, including subexpression elimination, register 
allocation, and the elimination of array bounds checking. Such optimizations 
may allow us to bridge the performance gap between our version with bounds 
checking and our version without bounds checking. 

6 Conclusions and Future Work 

Portability, security, and ease of use make Java an attractive programming en- 
vironment for software development. Performance problems and the absence of 
several language features have hindered its use in high-performance numerical 
computing. While operator overloading and parametric polymorphism are in- 
deed “syntactic sugar”, they will contribute significantly to the usability of the 
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language and to the willingness of the numerical computing community to use 
Java. We have quantified the difference between using a primitive type for com- 
plex numbers, which we have simulated, and using a class for complex numbers. 
As expected, there is strong evidence that a primitive type is needed. 

Future work in the development of high-performance object-oriented nume- 
rical libraries in Java can be divided into the following categories. 

Programming model changes. The algorithms implemented in most numerical 
libraries today were designed for the Fortran programming model. These may 
not be the best algorithms when run under the object model of Java. We have 
discussed several object-oriented programming idioms to implement numerical 
libraries efficiently. Future work needs to explore these and other techniques such 
as expression templates m- 

Compiler changes. We noted in Section^ several desirable optimizations that 
javac does not perform. Much work remains to be done here to develop better 
compilation techniques for Java. Budimlic and Kennedy m are exploring such 
optimizations using object inlining techniques. 

Just-In-Time compilation. Current JIT compilers are in their early version, 
and have not been heavily optimized. As we discussed in Section 0 some do not 
take advantage of machine-specific optimizations and do not appear to schedule 
code effectively. 

Architectural issues. Current trends in processor implementation adds signifi- 
cant instruction re-ordering capabilities to the hardware. Engler conjectures 
that this may reduce or obviate the need for instruction scheduling by JIT com- 
pilers. This is a reasonable conjecture that needs to be tested. 

Experimentation with other codes. LAPACK is obviously not representative 
of all numerical software. Further work needs to be done to determine if Java 
implementations of other numerical software behave similarly. 

Our results show that Java may perform well enough to be used for numerical 
computing, if a handful of concerns about language features and compilation 
strategies are adequately addressed. While we have not yet met the goal of having 
Java perform as well as Fortran, we are beginning to get reasonably close. We 
speculate that a combination of techniques will narrow this gap considerably over 
the next few years, and that Java will be the language of choice for numerical 
computing by the year 2000. 
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Abstract. We have used the Illinois Concert C++ system (which sup- 
ports dynamic, object-based parallelism) to parallelize a flexible adap- 
tive mesh refinement code for the Cosmology NSF Grand Challenge. 
Out goal is to enable programmers of large-scale numerical applications 
to build complex applications with irregular structure using a high-level 
interface. The key elements are an aggressive optimizing compiler and 
runtime system support that harnesses the performance of the SGI-Cray 
Origin 2000 shared memory architecture. We have developed a configura- 
ble runtime system and a flexible Structured Adaptive Mesh Refinement 
(SAMR) application that runs with good performance. We describe the 
programming of SAMR using the Illinois Concert System, which is a 
concurrent object-oriented parallel programming interface, documenting 
the modest parallelization effort. We obtain good performance of up to 
24.4 speedup on 32 processors of the Origin 2000. We also present results 
addressing the effect of virtual machine configuration and parallel grain 
size on performance. Our study characterizes the SAMR application and 
how our programming system design assists in parallelizing dynamic co- 
des using high-level programming. 



1 Introduction 

The challenges of parallel programming include load balancing, data distribution, 
and coordination of communication between separate streams of execution. In 
general, all three of these issues must be addressed in order to obtain the greatest 
performance benefit. Hardware cache-coherent shared memory multiprocessors 
offer a shared virtual address space with caches that are kept coherent automati- 
cally by hardware. This supports parallel programming by reducing the latency 
of remote memory access, making communication costs less critical. In prac- 
tice, inter-processor communication costs remain a key factor in determining 
performance. The latency of remote memory access due to cache misses is still 
high when compared to processor clock cycles and can greatly hamper parallel 
efficiency. 
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Shared memory multiprocessors are being scaled up to 512 or even 1024 pro- 
cessors. Studies are needed to determine what runtime system support can utilize 
this trend and what applications are enabled as a result. Structured Adaptive 
Mesh Refinement (SAMR) methods are an important class of numerical codes 
and are a target for such investigations because they contain dynamic paralle- 
lism and are difficult to program using a low-level, message passing programming 
paradigm. 

Our research addresses the use of large-scale shared memory systems, utili- 
zing a highly efficient runtime system to support application parallelization. Our 
runtime system allows programs to utilize varying degrees of shared memory sup- 
port. To demonstrate our technology, we have parallelized a large-scale SAMR 
application, a numerical simulation method being used for the NSF Cosmology 
Grand Challenge |Q. 

SAMR is a technique for simulating a discrete model of space. It works by 
solving hyperbolic partial differential equations numerically through a series of 
discrete timesteps. SAMR methods recognize that a large portion of this mo- 
deling area is often empty and/or constant throughout the simulation. SAMR 
uses a dynamic hierarchy of meshes, only creating high-resolution meshes where 
heuristics deem that they are needed. This approach saves computation while 
still providing high resolution modeling. SAMR has been successfully applied 
to a number of important problems and is gaining acceptance in the realm 
of scientific computation. Programming SAMR is more complex than single or 
multi-grid simulations, and this has prevented it from becoming more widely 
used. Data structures (i.e. meshes) in SAMR are created and deleted from the 
hierarchy as appropriate to achieve desired solution precision. Therefore, the 
parallelism across meshes is dynamic, and is not analyzable during compilation 
or even initialization of the program. SAMR’s dynamic parallelism must be ex- 
ploited during the run of the program, making parallelization of the method a 
challenge. 

The Illinois Concert System |2j and ICC-I--I- language Hi together form a 
parallel programming environment geared towards tackling dynamically parallel 
codes. This system provides a variety of support for dynamic parallelism and 
distributed parallel data structures - a global namespace, efficient fine-grained 
threads, orthogonal data distribution, and high performance runtime primitives. 
These capabilities make it possible to express dynamically parallel applications 
with modest effort, and preserves program flexibility for application tuning or 
algorithm improvement. The ICC-I— I- parallel language is based on the object- 
oriented language C-|— 1-. It provides simple syntactic language extensions to ex- 
press parallel blocks and loops in code, and a distributed data type for concurrent 
object distribution. 

This paper is a case study for parallelizing a dynamic application on large- 
scaled shared memory using a high-level object-oriented programming environ- 
ment. The contributions of this work are to show high absolute performance 
using a high-level parallel programming system, and to explore configurations 
of our runtime system, gauging how well they exploit the underlying hardware. 
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In our SAMR implementation the meshes are tiled, or partitioned into inde- 
pendent pieces in order to control granularity of parallelism. Our performance 
results show that a modest parallel grain size of 50x50 tiled meshes is best to 
provide ample parallelism, and minimize overhead of concurrent execution. We 
achieve up to 24.4 times speedup on a 32 processor Origin 2000 0 using ma- 
ximum shared memory support for communication, which is 93% of maximum 
feasible speedup for our application. In addition, we achieve 1 1. 7 times speedup 
on 16 processors for our application by relying heavily on our high-performance 
messaging support and simple additional techniques. Both of these performance 
results are promising, and demonstrate the usefulness of our configurable run- 
time system on shared memory machines. The good performance shows that 
high-level parallel programming with the Illinois Concert System is viable and 
can aid greatly in capture of dynamically parallel applications. 

The rest of this paper is structured as follows. Section 0 provides backgro- 
und on concurrent object-oriented programming and SAMR methods. Section 01 
describes our SAMR implementation in detail. Section 0 discusses the Concert 
system, with particular attention to runtime system capabilities. Section Elshows 
detailed performance results, and several conclusions that we have drawn. Sec- 
tion El highlights some related work. Section Qsummarizes our study, and briefly 
discusses future work. 



2 Background 

Concurrent Object-Oriented Programming (COOP) builds on the idea of objects 
for encapsulation in sequential programs, extending it for concurrent programs. 
Objects in COOP are seen as autonomous communicating entities. This view 
maintains the benefits of sequential object-orientation by providing reuse and 
modularity. It also allows for flexible concurrency, where objects can be distri- 
buted and operated upon either explicitly by the programmer or automatically 
by a compiler or runtime system. 

The Illinois Concert System is a programming environment that harnesses 
the benefits of COOP, with a goal of high performance. It consists of the ICC-I— I- 
language 0, Concert compiler 00111 and the Concert runtime system 0 . Con- 
cert supports fine-grained, concurrent object-oriented programming on Actors 
P). Computation is expressed as method invocations on objects or collections of 
objects. Concurrent method invocations operate against state stored in dynami- 
cally created thread data structures. Synchronization of concurrent threads is 
handled automatically by the system, and the user only needs to be concerned 
with specifying concurrent parts of the code and not concurrency management. 
ICC-I— I- ’s syntactic similarity to C-|— I- provides ease of annotation of sequential 
C-|— I- applications for concurrency. 

In addition to common features of sequential object-oriented languages such 
as object encapsulation and inheritance, three features of the programming mo- 
del support programming dynamic parallel applications. A shared name space 
allows programmers to build sophisticated distributed data structures without 
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explicit name management. Implicit dynamic thread creation frees programmers 
from explicit thread and synchronization management. Objeet-level eoncurrency 
eontrol maintains sequential consistency of the object state in the global name- 
space, freeing the programmer from managing explicit locking. 

The Concert system implementation HHHHIin] is a state-of-the-art im- 
plementation of a COOP programming model. A full discussion of the Concert 
runtime system can be found in m- The Concert compiler ^ implements a 
number of aggressive, inter-procedural optimizations that achieve high perfor- 
mance for sequential object-oriented codes. An implementation of Fast Messages 
mg is used for low-overhead communication between address spaces. 

Adaptive mesh methods are a class of finite difference method that provide 
high modeling resolution and computational efficiency by generating a hierarchy 
of grids, comprised of a series of levels. Each level contains a list of grids, with 
grids on a lower level modeling progressively smaller physical areas of space. 
Grids at lower levels derived from higher-level grids are said to be subgrids or 
child grids, with the higher level grid acting as the child grids parent. Further- 
more, the hierarchy is dynamic in that grids can be created and deleted from 
the hierarchy as the simulation proceeds. This framework sets the stage for ad- 
aptive methods and provides the benefits of computational efficiency as well as 
arbitrary resolution. 

3 SAMR Implementation 

Our study is based on a sequential SAMR method written by scientists at NCSA 
in the Computational Cosmology group. It has been used in a number of cosmo- 
logy experiments PJ . It is written in C-|— I- and uses Fortran 77 kernels to perform 
the compute-intensive interpolation and partial differential equation solves. 

3.1 Code Structure 

The code is structured as a series of phases. Each phase is either an inter- mesh 
communication phase or is mesh independent. An example of a mesh independent 
phase is the partial differential equation solve, and an example of a communica- 
tion phase is the mesh-to-mesh boundary copy phase. The code first generates 
an initial hierarchy of meshes to model the physical space, and then iterates over 
the phases, for each timestep dt, at each level of the hierarchy. 

After each iteration of the phases, the entire hierarchy is regridded from the 
current level down. The first three of the seven phases are mesh independent. The 
rest of the phases (including the regridding) involve inter-mesh communication. 
Hierarchies generated by our code typically consist of hundreds of 2D grids each 
ranging in size from 100 to 1000 cellpoints in each dimension. 

3.2 Code Parallelization 

Each of the phases listed above contain parallelism over meshes, or pairs of 
meshes at a level. Dependences in our particular implementation force us to be 
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able to exploit only intra-level parallelism. Even so, almost the entire method 
can be parallelized. 

Current popular parallelization approaches struggle to capture parallelism 
over dynamic data structures created during the run of a program. Explicit 
message passing systems such as MPI force users to manually track the location 
of every dynamically created data structure. This is not a good match for the 
SAMR method, which creates an entirely new hierarchy of possibly hundreds of 
new grids at every iteration. Compiler-based approaches such as HPF m need 
compile-time knowledge to load balance such an application, and are usually 
unable to glean such information through analysis. We have also explored the 
use of thread-based shared memory to parallelize the code. In brief, these efforts 
showed that the support that exists on shared memory machines lack the flexi- 
bility and robustness to parallelize codes that are as fine-grained as our SAMR 
implementation, without a great deal of additional programming effort. 

ICC-I — h and Concert allow the user to annotate dynamic concurrency in a 
clear, simple way and manages the concurrent execution automatically. By ma- 
king each mesh an object, we express each parallel phase as a series of concurrent 
method calls as follows: 

class Grid { 

public : 
void SolveO ; 

}; 

cone forCall grids G at this level) // Solve loop 
G. SolveO ; 

The ICC-I — h cone annotation in the for loop above annotates the loop as 
concurrent. All of the calls to the Solve routine execute independently, and syn- 
chronize barrier-style at the end of the loop. 

3.3 Tiling 

There are inherent problems with a simplistic approach of distributing meshes for 
concurrency. For load balance, mesh-based distribution depends on the algorithm 
to provide parallelism. For example, if the method’s heuristics choose to create 
only three grids at a particular level, and we are running on four processors, 
there is no way to achieve good load balance. 

To address this problem, we have employed a technique called tiling. In ti- 
ling, we use a source code transformation to partition each mesh into equal- 
sized parts. Our original grids become distributed arrays of tiles, and the tiles 
are distributed among processors. We have ensured that this transformation is 
transparent, and produces the same results as the untiled simulation. Tiling 
gives us controllable, uniform granularity, so that data structures can be distri- 
buted evenly across address spaces. The cost of the transformation is increased 
communication, as each tile has its own boundary region. We shall see that for 
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certain tile sizes, the load balance benefits of tiling far outweigh the additional 
communication cost incurred in terms of parallel performance. 



4 Concert Runtime Configuration 

This section will describe the Concert runtime system configurations on shared- 
memory machines. The Concert runtime logically distributes the underlying me- 
mory into address spaces over which a program’s objects are distributed for con- 
current execution. The address spaces each contain heavyweight threads (e.g. 
Unix processes) to execute lightweight logical Concert threads. A full descrip- 
tion of the system can be found in COl. We have developed a configurable version 
of this system for the shared memory architecture. 



Configuration 2: 

Configuration 1: Singler Worker/Space Multiple Workers/Space 




Address Space 2 



Fig. 1. Runtime configurations 



The Concert runtime implementation on the shared memory SGI-Cray Origin 
2000 machine is architected to take advantage of the hardware shared memory 
support. The system is configurable to utilize a range of levels of shared memory 
support versus messaging support for communication. This is accomplished with 
multiple system threads in each address space. These threads share a work queue 
(see Fig.Q]). Assuming an even distribution of objects, address spaces in Confi- 
guration 1 tend to contain fewer objects each than in Configuration 2 for a given 
number of object instances in a program. Thus sharing in Configuration 2 utilizes 
more shared memory support, at a cost of decreased locality and processor-data 
affinity. This flexibility in design allows applications to choose a degree of uti- 
lization of shared memory support. We can enumerate a range of configurations, 
including pure shared memory, hybrid models with multiple thread per address 
space, and pure distributed memory. In general, these configurations are listed 
from least to most messaging support use, and from least to most processor-data 
affinity. 



High-Level Parallel Programming 



53 



5 Performance Results 

Our investigation evaluates the use of high-level programming to get performance 
from the target application. Related to this, we identify several issues that we will 
explore with our performance experiments. Granularity Versus Available Paral- 
lelism; the tiling transformation allows us to tune our application to run with a 
range of parallel grain sizes and available parallelism. We vary the tiling para- 
meter and measure the effects on performance. Runtime Configuration allows us 
to compare the benefits of shared memory utilization to those of increased data 
locality. The overall goal is to evaluate system performance by varying the way 
in which we utilize the underlying shared memory hardware. We also seek a high 
level of absolute performance for the SAMR code. All experiments were run on 
the NCSA Origin 2000, on configurations of up to 32 processors with 12Gb of 
total physical memory running IRIX version 6.4. 

The input data set for our experiments is a two-dimensional ShockTube si- 
mulation ip. This test case can be run either adaptively, or as a non-adaptive, 
single-mesh simulation. Size of the input data set is roughly governed by the size 
of the top-level mesh in the hierarchy, which is specified as a program parameter. 
For a complete description of test cases and a full set of results, see H2|. 

We benchmarked the ICC-I-+ code against the C-I-+ compiled using the 
SGI CC compiler with full (-03) optimization and found the ICC-I-+ code with 
parallel annotations to be no more than 15% slower. These results validate par- 
allelization of the code. All parallel speedup results are presented with respect 
to the sequential ICC-I-+ code. 

5.1 Single Address Space Configuration 

The parallel results shown in this section use a Concert virtual machine with 
all system threads in a single address space. This configuration uses hardware 
shared memory support for all sharing and communication. Objects can reside 
in any processors cache, and invalidation of cached data can be issued by any 
other processor. 

Accompanying each speedup graph in this section and Sect. 15.21 will be a cor- 
responding graph of percent maximum feasible speedup. The latter graphs show 
the percentage of maximum feasible speedup that each run of the code achie- 
ved. The maximum feasible speedup is less than the ideal speedup because of 
sequential portions of the SAMR algorithm and because of the limited available 
parallelism with larger tile sizes. For each parallel execution of the code, the ma- 
ximum feasible speedup is the sequential execution time divided by the feasible 
parallel execution time (FPET). 

To calculate FPET, we proceeded phase by phase as follows: If a phase is 
entirely sequential, add its sequential execution time to the FPET. For parallel 
phases, we considered that the number of grains of available parallelism (e.g. 
tiles) in a given phase might be less than the number of processors. If a parallel 
phase has more tiles than processors in the run, then we divide the sequential 
execution time of the phase by the number of processors and add the result 
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NumDer of processors 



(a) 100x100 top grid (b) 200x200 top grid 



(c) 300x300 top grid 




(d) % speedup 100x100 (e) % speedup 200x200 (f) % speedup 300x300 



Fig. 2. Multi-level test case 



to FPET. Otherwise, we divide the sequential execution time by the maximum 
number of tiles involved in the phase and add the result to FPET. We emphasize 
that our calculations use the maximum number of tiles operated on by a phase 
over the entire run as the number of tiles for that phase. Maximum feasible 
speedup is therefore the maximum speedup that can be achieved by each run 
of the code, accounting for both sequential portions of the code and lack of 
parallelism. It is more accurate than ideal speedup as a comparison point to 
evaluate of Concert parallelization performance. 

Figure 121 shows our results for the single address space configuration. The test 
case run was the adaptive ShockTube 2D test case with a two-level hierarchy. 
The caption for each figure lists the size of the top-level grid that was run, which 
approximately determines test case size. 

Clearly, tile size affects performance as the code runs on more than four 
processors. For the 300x300 test case, we see that both 10x10 and 20x20 tile 
sizes have fallen to less than 70% of feasible speedup on 16 processors. The cost 
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of synchronization and data movement between tiles is limiting performance. 
The 50x50 tile size cases remain steady achieving 93% of feasible speedup on 32 
processors. This last result tells us that we have ample parallelism for up to 32 
processors, even for the largest tile size shown here. 

Our experiments show that speedup of the tile-to-tile communication phases 
are the poorest, and tend to limit overall parallel efficiency. See for more de- 
tails. We do achieve high overall performance, with a high percentage of feasible 
speedup. This tells us that the shared memory support has aided in paralleliza- 
tion and that the runtime system’s load balancing strategy works well with this 
dynamically parallel code. 

5.2 Single Processor Per Address Space Configuration 

For the results of this section, the virtual machine configuration creates one 
system thread per address space. This configuration relies heavily on the runtime 
messaging layer. For these experiments, we have chosen to run only single-grid, 
non-adaptive test cases. 

Our high-level programming interface allows us to program communication 
between meshes or tiles as memory copies in our source code, without having to 
manipulate message passing. For mesh-to-mesh communication, each individual 
value copied begets a message send in the generated code. This results in a large 
amount of overhead and synchronization cost between address spaces. In order 
to limit and measure the magnitude of this problem, we use a technique called 
message aggregation. For the inner copy loop of our communication phase, we 
simply divide the number of message sends by a constant factor. We perform 
the aggregation through a source code transformation, which increases sequential 
time of the ICC-I— I- code by 20% and we assume a constant aggregation factor. 
The aggregation factor for the experiments shown here is nine, and speedups 
are shown with respect to the ICC-I— I- code with aggregation, running on one 
processor. 

Figure 0 shows the results of our experiments. We can see that tile size still 
plays a pivotal role; smaller tiles imply less performance. Compare the speedup 
of the 50x50 tile with aggregation to that of the 50x50 tile size without. For 
the 1000x1000 size grid, message aggregation produces good speedups, almost 
doubling the speedups without aggregation. Note that both the speedup and 
the actual running time of the transformed code is better than the original on 
more than eight processors. These results show the benefits of using the high 
performance messaging with a simple transformation. We achieve close to 90% 
of feasible speedup with 16 processors for the 1000x1000 case. 

5.3 Summary of Results 

In this section we saw that we achieved the best absolute performance with the 
pure shared memory approach and a moderate tile size. This is due mainly to 
the advanced hardware support for remote memory access being faster than even 
our low-overhead messaging implementation. 
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(a) 400x400 grid 



(b) 600x600 grid 



(c) 1000x1000 grid 




NumDer ot processors 





(d) 400x400 grid (e) 600x600 grid 



(f) 1000x1000 grid 



Fig. 3. Results for single address space per processor, single level hierarchy 



The multiple address space runs showed us that a simple, constant fac- 
tor source code transformation can make use of high performance messaging 
competitive with expensive shared memory support. The performance improve- 
ment with an aggregation factor of nine indicates that an order of magnitude of 
overhead and synchronization reduction may enable us to use messaging exclusi- 
vely for communication. This factor would presumably be higher if less efficient 
messaging primitives were used. 



6 Related Work 

The Overture project H3! at Los Alamos National Laboratory is also a framework 
for writing parallel SAMR methods. It is built atop the A-|— I-/P-I--I- [E3| parallel 
array class library. Overture is a library of C-|— I- classes designed to represent 
grids and grid functions. These classes can be instantiated to form a SAMR 
implementation with a rich set of operations for dealing with different hierarchy 
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topologies and boundary conditions. As Overture is a domain-specific framework, 
it is restricted to grid-based simulations. 

The LPARX parallel system is designed to address the problem of block 
irregular mesh methods and the problems they present. The system is designed 
to parallelize methods such as SAMR efficiently, using a collaboration of user 
and runtime support. LPARX is restricted to regular, non-overlapping grids, 
and as such is more domain-specific than Overture, and does not approach the 
richness of data types of an arbitrary object-based system. 

7 Summary and Future Work 

We achieve up to 24.4 speedup on 32 Origin 2000 processors using maximum 
shared memory support for communication, which is 93% of maximum feasible 
speedup for our application. In addition, we get 11.7 speedup on 16 processors for 
our application by relying heavily on our high-performance messaging support 
and simple additional techniques. From these results, we can see that the confi- 
gurable runtime can be used to customize large-scaled shared memory machines 
to fit specific application behavior, and can therefore potentially be effective on 
a large range of programs. These results also show that a modest grain size can 
obtain good performance when utilizing high-performance primitives. 

In the future, we plan to expand our experiments on the hybrid distributed- 
shared memory model. Our single shared address space results suggest that 
shared memory, in combination with flexible runtime support yields good per- 
formance. Our multi-address space results imply that the approach of creating 
locality regions of memory and passing messages between them can be a power- 
ful systems technology that will scale shared memory machines towards massive 
parallelism. 
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Abstract. We present a unified approach for building high-performance 
numerical linear algebra routines for large classes of dense and sparse 
matrices. As with the Standard Template Library Q, we separate algo- 
rithms from data structures using generic programming techniques. Such 
an approach does not hinder high performance; rather, writing porta- 
ble high-performance codes is enabled because the performance-critical 
code can be isolated from the algorithms and data structures. We ad- 
dress the performance portability problem for architecture-dependent 
algorithms such as matrix-matrix multiply. Recently, code generation 
systems, such as PHiPAC |2| and ATLAS 0, have allowed algorithms 
to be tuned to particular architectures. Our approach is to use template 
metaprograms ^ to directly express performance-critical, architecture- 
dependent, sections of code. 



1 Introduction 

Traditional basic linear algebra routines require combinatorial numbers of ver- 
sions: four precision types (single and double real, single and double complex), 
several dense storage types (general, banded, packed) , a multitude of sparse sto- 
rage types (13 in the Sparse BLAS Standard Proposal 0), as well as row and 
column orientations for each matrix type. A full implementation might require 
hundreds of versions of the same routineQ Further, the performance of codes 
such as matrix-matrix multiply is highly sensitive to memory hierarchy cha- 
racteristics, so writing portable high-performance codes is even more difficult. 
A code generation system on top of C or Fortran has been needed to get the 
flexibility needed for register blocking according to computer architecture. 

In this paper we apply fundamental generic programming approaches used by 
the Standard Template Library (STL) to the domain of numerical linear algebra. 
The resulting library, the Matrix Template Library (MTL) , provides comprehen- 
sive functionality with a few fundamental algorithms, while also achieving high 
performance. We explore the use of template metaprograms in the construction 

* This work was supported by NSF grants ASC94-22380 and CCR95-02710. 

^ It is no wonder the NIST implementation of the Sparse BLAS contains over 10,000 
routines and an automatic code generation system j^. 
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template <class Row2DIter, class IterX, class IterY> void 
matvec : :mult (Row2DIter i, Row2DIter lend, IterX x, IterY y) {. 
typename Row2DIter : : value_type : : const_iterator j; 
while (not_at(i, lend)) 
j = (*i) .beginO ; 

typename IterY: :value_type tmp(O); 
while (not_at(j, (*i).end())) ■[ 
tmp += *j * x[j . indexO] ; 

++j ; 

} 

y[i. index ()] = tmp; 

++i; 

} 



Fig. 1. Simplified example of a generic matrix- vector product 



of the BLAIS kernels, which provide an elegant solution to portable high perfor- 
mance for matrix-matrix multiply and other blocked codes. 

The MTL is in its second generation^ the first having been described pre- 
viously 13 • The current version uses generic programming to a much larger degree 
than its predecessor. 

2 Generic Programming 

The principal idea behind the STL is that many algorithms can be abstracted 
away from the particular data structures on which they operate. Algorithms 
typically need the abstract functionality of being able to traverse through a data 
structure and access its elements. If data structures provide a standard interface 
for traversal and access, generic algorithms can be mixed and matched with 
data structures (called containers in STL). This interface is realized through the 
iterator (sometimes called a generalized pointer). 

Abstractly, linear algebra operations also consist of traversing through vec- 
tors and matrices. Vector operations fit neatly into the generic programming 
approach. The STL already defines several generic algorithms for vectors, such 
as inner _pr oduct 0 . Extending these generic algorithms to encompass the rest 
of the Level- 1 BLAS |S| is a trivial matter. 

Matrix operations are slightly more complex, since the elements are arranged 
in a 2-dimensional format. The MTL processes matrices as if they are containers 
of containers (note that the matrix implementations are typically not actual 
containers of containers). The matrix algorithms are coded in terms of iterators 
and two-dimensional iterators. A Row2DIter can traverse the rows of a matrix, 
and produces a row vector when dereferenced. The iterator for the row vector 
can then be used to access the individual matrix elements. The example in Fig. E 
shows how one can write a generic matrix- vector product. 

The MTL is available at http://www.lsc.nd.edu/research/mtl/. 
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Table 1. MTL linear algebra operations 



Function Name 


Operation 


Function Name 


Operation 


Vector Algorithms 




Vector Vector 




set (x, alpha) 
scale (x , alpha) 
s = sum(x) 
s = one_norm(x) 
s = two_norm(x) 
s = inf_norm(x) 
i = f ind_max_abs (x) 
s = max(x) 
s = min(x) 


Xi a 
X ■4— ax 

s ^ Ei 
s ^ Ei 1 1 

s 4— max 1 Xi 1 
i 4— index of max | Xi \ 
s 4— max(ri) 
s 4— min(a;i) 


copy(x,y) 
swap(x,y) 
elejmlt (x,y,z) 
ele_div(x,y ,z) 
add(x,y) 
s = dot(x,y) 
s = dot_conj (x,y) 


y 4- a; 
y X 

z y ® X 
z <— y 0 X 
y <-x + y 
s x^ ■ y 

, T - 

s •<— a: ' y 


Matrix Algorithms 




Matrix Vector 




set (A, alpha) 
scale (A , alpha) 
set_diag (A , alpha) 
s = one_norm(A) 
s = inf_norm(A) 
transpose (A) 


A 4— a 
A 4- oA 
An 4 — a 

s 4- maxi{J2j O'ij ) 
s ^ maXj{J2i 1 “b 1) 


mult (A,x ,y) 
mult (A,x ,y ,z) 
tri_solve (T , x , y) 
rank_one(x,A) 
rank_two(x,y,A) 


y A X X 

z A X X + y 

y 4— T~^ X X 
A 4— a: X -I- A 

A 4— a; X y^-b 
y X a;^ -b A 


Matrix Matrix 








copy (A, B) 
add(A,C) 
mult (A,B,C) 
tri_solve (T , B , C) 


A 

A + C 
C^AxB 
C ^ T~^ X B 


swap (A, B) 
eleunult (A,B,C) 
mult (A,B ,C ,E) 


B ^ A 
C ^ B0A 

Ax B + C 



3 MTL Algorithms 

Tabled lists the principal algorithms covered by the MTL. This list seems sparse, 
but a large number of functions are indeed provided through the combination 
of the above algorithms with the stridedO, scaledO, and transO adapter 
functions. Figure shows how this is done with a matrix-vector multiply and 
with a scaled vector assignment. 

The unique feature of the MTL is that, for the most part, each of the algo- 
rithms is implemented with just one template function. Just one algorithm is 
used whether the matrix is sparse, dense, banded, single precision, double, com- 
plex, etc. From a software maintenance standpoint, the reuse of code gives the 
MTL a significant advantage over the BLAS pun] or even other object-oriented 
libraries like TNT PH (which has different algorithms for different matrix for- 
mats). 

The generic algorithm code reuse results in the MTL having 10 times fewer 
lines of code than the Netlib Fortran BLAS while providing greater functionality 
and achieving generally better performance, especially for level 2 and 3 operati- 
ons. The MTL has 8,284 lines of code for the algorithms and 6,900 lines of code 
for dense containers, for a total of 15,184 lines of code. The Fortran BLAS total 
154,495 lines of code, an order of magnitude more. 
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// y <- A * alpha x 

matvec : :mult (trans (A) , scaledCx, alpha), strided(y , incy) ) ; 
II y <- alpha x 

vecvec :: copy (scaledCx, alpha), y) ; 

Fig. 2. Transpose, scaled, and strided adapters 



4 MTL Components 

The MTL defines a set of data structures and other components for representing 
linear algebra objects. An MTL matrix is constructed with layers of components. 
Each layer is a collection of classes that are templated on the lower layer. The 
bottom most layer consists of the numerical types (float, double, etc). The next 
layers consist of 1-D containers followed by 2-D containers. The 2-D containers 
are wrapped up with an orientation, which in turn is wrapped with a shape. A 
complete MTL matrix type typically consists of a templated expression in the 
form 



shape<orientation<twod<oned<numjtype> > > > 

For example, an upper triangular matrix would be defined as 

triangle < column < dense2D < double >>, upper > 

Some 2-D containers also subsume the 1-D type, such as the contiguous dense2D 
container. 

Matrix Orientation The row and column adapters map the major and minor 
aspects of a matrix to the corresponding row or column. This technique allows 
the same code for data structures to provide both row and column orientations 
of the matrix. 2-D containers must be wrapped up with one of these adapters to 
be used in the MTL algorithms. 

Matrix Shape Matrices can be categorized into several shapes: general, upper 
triangular, lower triangular, symmetric, Hermitian, etc. The traditional approach 
to handling the algorithmic differences due to shape is to have a separate function 
for each type. For instance, in the BLAS we have a _GEMV, _SYMV, _TRMV, etc. 
The MTL instead uses different data structures for each shape, with the banded, 
triangle, symmetric, and hermitian matrix adapters. It is the responsibility 
of these adapters to make sure that they work with all of the MTL generic 
algorithms. The MTL philosophy is to use smarter data structures to allow for 
fewer and simpler algorithms. 

5 The High Performance Layer 

We have presented many levels of abstraction, and a set of unified algorithms for 
a variety of matrices, but high performance must be achieved. Template-based 
programming coupled with modern compilers such as KAI C-| — h |l 2] provide 
several mechanisms for high-performance. 
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Static Polymorphism The template facilities in C++ allow functions to be selec- 
ted at compile-time based on data type. This provides a mechanism for abstrac- 
tion which preserves high performance. Dynamic (run-time) dispatch is avoided, 
and the template functions can be inlined just as regular functions. This ensures 
that the numerous small function calls in the MTL (such as iterator increment 
operators) introduce no extra overhead. 

Lightweight Object Optimization The generic programming style introduces a 
large number of small objects into the code. This incurs a performance pen- 
alty because the presence of a structure can interfere with other optimizations, 
including the mapping of the individual data items to registers. This problem 
is solved with small object optimization, also know as scalar replacement of 
aggregates H3|, which is performed by the KAI C++ compiler. 

Automatic Unrolling Modern compilers do a great job of unrolling loops and 
scheduling instructions, but typically only for recognizable cases. There are many 
ways, especially in C and C-| — h, to interfere with the optimization process. The 
abstractions of the MTL are designed to result in code that is easy for the 
compiler to optimize. Furthermore, the iterator abstraction makes inter-compiler 
portability possible, since it encapsulates how looping is performed. 

Algorithmic Blocking The bane of portable high performance numerical linear 
algebra is the need to tailor key routines to specific execution environments. For 
example, to obtain high performance on a modern microprocessor, an algorithm 
must properly exploit the memory hierarchy and pipeline architecture (typically 
through careful loop blocking and structuring). Ideally, one would like to ex- 
press high performance algorithms in a portable fashion, but there is not enough 
expressiveness in languages such as C or Fortran to do so. Recent efforts (PHi- 
PAC 0, ATLAS 13) have resorted to going outside the language, i.e., to code 
generation systems, in order to gain this kind of flexibility. The Basic Linear 
Algebra Instruction Set (BLAIS) is a library specification that takes advantage 
of C++ features to express high-performance loop structures at a high level. 

5.1 The Basic Linear Algebra Instruction Set (BLAIS) 

The BLAIS specification contains fixed-size algorithms with functionality equi- 
valent to that of the Level- 1, Level-2, and Level-3 BLAS 0linilE|. The BLAIS 
routines themselves are implemented using the Fixed Algorithm Size Template 
(FAST) library, which contains general purpose fixed-size algorithms equivalent 
in functionality to the generic algorithms in the STL. The thin BLAIS routi- 
nes map the generic FAST algorithms into fixed-size mathematical operations. 
There is no added overhead in the layering because all the function calls are 
inlined. Using the FAST library allows the BLAIS routines to be expressed in 
a simple and elegant fashion. Note that the intended use of the BLAIS routines 
is to carry out the register blocking within a larger algorithm. This means the 
BLAIS routines handle only small matrices, and therefore avoid the problem of 
excessive code bloat. 
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int x[4] = y[4] = { 2 , 2 , 2 , 2 }; 

1 1 STL 

template <class Initerl, Inlter2, Outiter, BinaryOp> 

Outiter transf ormdniterl firstl , Initerl lastl , Inlter2 first2, 
Dutiter result .BinaryOp binary_op) ; 

transform(x, x + 4, y, y, plus<int>()) ; 

// FAST 

template <int N, class Initerl, class Inlter2, 
class Outiter, class BinOp> 

Outiter fast :: transf ormdniterl firstl, cnt<N> , Inlter2 first2, 

Outiter result, BinOp binary_op) ; 

fast :: transform (x, cnt<4>(), y, y, plus<int> () ) ; 

Fig. 3. Example usage of STL and FAST versions of transformO 



We describe the FAST algorithms and show how the BLAIS are construc- 
ted from them. We show how the BLAIS can be used as high-level instructions 
(kernels) to handle the register-level blocking in a matrix-matrix product. Expe- 
rimental results show that the performance obtained by our approach can equal 
and even exceed that of vendor-tuned libraries. 



// The general case 

template <int N, class Initerl, class Inlter2, 
class Outiter, class BinOp> 
inline Outiter 

fast :: transform (Initerl firstl, cnt<N>, Inlter2 first2, 
Outiter result , BinOp binary_op) { 

♦result = binary_op (*firstl, *first2) ; 
return transf orm(++firstl , cnt<N-l>(), ++first2, 

++result , binary_op) ; 

> 

// The N = 0 case to stop template recursion 
template<class Initrl, class Inltr2, class Outitr, class BinOp> 
inline Outitr 

fast :: transform (Initrl firstl, cnt<0>, Inltr2 first2, 

Outitr result, BinOp binary_op) { 
return result; } 

Fig. 4. Definition of FAST transformO 



Fixed Algorithm Size Template (FAST) Library The FAST Library includes 
generic algorithms such as transformO, for_each(), inner _pr oduct 0 , and 
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accumulate 0 that are found in the STL. The interface closely follows that of 
the STL. All input is in the form of iterators. The only difference is that the 
loop-end iterator is replaced by a count template object. The example shown in 
Fig.0 demonstrates the use of both the STL and FAST versions of transformO 
to realize an AXPY-like operation {y ^ x + y). The firstl and lastl parame- 
ters are iterators for the first input container (indicating the beginning and end 
of the container, respectively). The first2 parameter is an iterator indicating 
the beginning of the second input container. The result parameter is an ite- 
rator indicating the start of the output container. The binary _op parameter is 
a function object that combines the elements from the first and second input 
containers into the result containers. 



// Definition 

template <int N> struct vecvec::add { 

template <class Iterl, class Iter2> inline 
vecvec :: add (Iter 1 x, Iter2 y) { 

typedef typename iterator_traits<Iterl> : : value_type T; 
f ast : :transform(x, cnt<N>(), y, y, plus<T>()); 



>}; 

// Example use 

double X [4] , y [4] ; 

filKx, x+4, 1); filKy, y+4, 5); 

double a = 2 ; 

vecvec: :add<4>(scl(x, a), y) ; 



// y[0] += a * x[0] 
// y[l] += a * x[l] 
// y[2] += a * x[2] 
// y[3] += a * x[3] 



Fig. 5. Definition and use of BLAIS addO 



// General Case 
template <int M, int N> 
struct mult { 

template <class AColIter, class IterX, class IterY> inline 
mult (AColIter A_2Diter, IterX x, IterY y) { 

vecvec :: add<M> (scl( (*A_2Diter) .beginO , *x) , y) ; 
mult<M, N-l> (++A_2Diter , ++x, y) ; 

} 

>; 

// N = 0 Case 
template <int M> 
struct mult<M, 0> { 

template <class AColIter, class IterX, class IterY> inline 
mult (AColIter A_2Diter, IterX x, IterY y) { 

// do nothing 

} 

>; 



Fig. 6. BLAIS matrix-vector multiplication 
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The difference between the STL and FAST algorithms is that STL accom- 
modates containers of arbitrary size, with the size being specified at run-time. 
FAST also works with containers of arbitrary size, but the size is fixed at compile 
time. In Fig. 21 we show how the FAST transformO routine is implemented. 
We use a tail-recursive algorithm to achieve complete unrolling — there is no 
actual loop in the FAST trainsf ormO . The template-recursive calls are inlined, 
resulting in a sequence of N copies of the inner loop statement. This technique 
(sometimes called template metaprograms) has been used to a large degree in 
the Blitz-|— I- Library m- 

BLAIS Vector- Vector Operations Figure 0 gives the implementation for the 
BLAIS vector addO routine, and shows an example of its use. The FAST 
trainsf ormO algorithm is used to carry out the vector- vector addition as it 
was in the example above. 

The comments on the right show the resulting code after the call to addO is 
inlined. The scl() function used above demonstrates the purpose of the scale_- 
iterator. The scale_iterator multiplies the value from x by a when the ite- 
rator is dereferenced within the addO routine. This adds no extra time or space 
overhead due to inlining and lightweight object optimizations. The scl(x, a) 
call automatically creates the proper scale_iterator out of x and a. 

BLAIS Matrix- Vector Operations The BLAIS matrix-vector multiply implemen- 
tation is depicted in Fig. The algorithm simply carries out the vector add 
operation for the columns of the matrix. Again a fixed depth recursize algorithm 
is used, which becomes inlined by the compiler. 

BLAIS Matrix-Matrix Operations The BLAIS matrix-matrix multiply is imple- 
mented using the BLAIS matrix-vector operation. The code looks very similar 
to the matrix vector multiply, except that there are three integer template argu- 
ments (M, N, and K), and the inner “loop” contains a call to matvec; :mult() 
instead of vecvec: :add(). 

5.2 A Configurable Recursive Matrix- Matrix Multiply 

A high performance matrix-matrix multiply code is highly sensitive to the me- 
mory hierarchy of a machine, from the number of registers to the levels and 
sizes of cache. For highest performance, algorithmic blocking must be done at 
each level of the memory hierarchy. A natural way to formulate this is to write 
the matrix-matrix multiply in a recursive fashion, where each level of recursion 
performs blocking for a particular level of the memory hierarchy. 

We take this approach in the MTL algorithm. The size and shapes of the 
blocks at each level are determined by the blocking adapter. Each adapter con- 
tains the information for the next level of blocking. In this way the recursive 
algorithm is determined by a recursive template data-structure (set up at com- 
pile time). The setup code for the matrix-matrix multiply is shown in Fig. 0 
This example blocks for just one level of cache, with 64 x 64 blocks. The small 4 x 
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template <class MatA, class MatB, class MatO 
void matmat :: mult (Mat A& A, MatBfe B, MatC& C) { 

MatA : :RegisterBlock<4, A_L0; MatA: :Block<64,64> A_L1; 
MatB : :RegisterBlock<l , 2> B_L0; MatB: :Block<64,64> B_L1; 
MatC: :CopyBlock<4,2> C_L0; MatC: :Block<64,64> C_L1; 

matmat : : mult (block (block (A, A_L0) , A_L1) , 

block(block(B, B_L0) , B_L1) , 
block(block(C, C_L0) , C_L1)); 

} 



Fig. 7. Setup for the recursive matrix-matrix product 

template <class MatA, class MatB, class MatO 

void matmat:: mult (Mat A& A, MatBfe B, MatC& C) { 

A_k = A.begin_columns() ; B_k = B ,begin_rows () ; 
while (not_at(A_k, A. end_columns () ) ) { 

C_i = C.begin_rows() ; A_ki = (*A_k) .beginO ; 
while (not_at(C_i, C. end_rows () ) ) { 

B_kj = (*B_k) .beginO ; C_ij = (*C_i) .beginO ; 

MatA:: Block A_block = *A_ki; 

while (not_at (B_kj , (*B_k) . end() ) ) { 

mult (A_block, *B_kj , *C_ij); 

++B_kj ; ++C_ij ; 

} ++C_i; ++A_ki ; 

} ++A_k; ++B_k; 

} 

} 



Fig. 8. A recursive matrix-matrix product algorithm 



2 blocks fit into registers. Note that these numbers would normally be constants 
that are set in a header file. 

The recursive algorithm is listed in Fig. The bottom recursion level is im- 
plemented with a separate function that uses the BLAIS matrix-matrix multiply, 
and “cleans up” the leftover edge pieces. 



5.3 Optimizing Cache Conflict Misses 

Besides blocking, another optimization for matrix-matrix multiply code is block 
copying. Typically utilization of the level- 1 cache is much lower than one might 
expect due to cache conflict misses. This is especially apparent in direct-mapped 
and low associativity caches. This problem is minimized copying the current 
block of matrix A into a contiguous section of memory m, allowing blocking 
sizes closer to the size of the L-1 cache without inducing as many cache conflict 
misses. 

It turns out that this optimization is straightforward to implement in our 
recursive matrix-matrix multiply. We already have block objects (submatrices 
A_block, *B_j, and *C_j) in Fig. 0 We modify the constructors for these objects 
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Fig. 9. Performance results for sparse matrix-vector multiply 



to make a copy to a contiguous part of memory, and the destructors to copy the 
block back to the original matrix. This is especially nice since the optimization 
does not clutter the algorithm code, but instead the change is encapsulated in 
the copy .block matrix class. 

6 Performance Results 

We present performance results comparing the MTL to several public-domain 
and vendor-tuned numerical libraries. Timings were obtained on a Sun UltraS- 
PARC 170E workstation using KAI C-I-+ (for C-I-+ to C translation) and 
the Solaris C compiler with maximum available optimizations. Figure 0 shows 
performance results for matrix-vector product computation using an assortment 
of sparse matrices from the MatrixMarket m- Results are shown for the MTL, 
SPARSKIT ^7] (Fortran), NIST [01(C), and TNT (C-I-+), using row-major com- 
pressed storage. Performance results for dense matrix-matrix multiply are shown 
in Fig. m where we compare the MTL, the Sun Performance Library, TNT, and 
the Netlib Fortran BLAS, using column-major storage. 

7 Supplemental Libraries 

The MTL provides an extensive foundation for other portable high-performance 
libraries. We have created two: the Iterative template Library (ITL), and an 
implementation of the legacy BLAS. The ITL is a collection of iterative solvers 
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Fig. 10. Performance results for dense matrix-matrix multiply 



(similar to the Iterative Methods Library d) that uses the MTL for its basic 
linear algebra operations. Our legacy BLAS implementation is a simple Fortran- 
callable interface to the MTL data structures and algorithms. We have also 
provided an MTL interface to LAPACK d, so that users of the MTL have a 
convenient way to access the LAPACK functionality. 

8 Conclusion 

Attempts to create portable high performance linear algebra routines have used- 
specialized code generation scripts to provide enough flexibility in C and Fortran. 
We have shown that C-I-+ has enough expressiveness to allow codes to be recon- 
figured for particular architectures by merely changing a few constants. Further, 
advanced C-I-+ compilers can still aggressively optimize in the presence of the 
powerful MTL abstractions, producing code that matches or exceeds the perfor- 
mance of hand-coded C and vendor-tuned libraries. 
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Abstract. We present a parallel runtime substrate that supports a glo- 
bal addressing scheme, object mobility, and automatic message forwar- 
ding required for the implementation of adaptive applications on distri- 
buted memory machines. Our approach is application-driven; the target 
applications are characterized by very large variations in time and length 
scales. Preliminary performance data from parallel unstructured adap- 
tive mesh rehnement on an SP2 suggest that the flexibility and general 
nature of the approach we follow does not cause undue overhead. 



1 Introduction 

We present a lean, language-independent, and easy to port and maintain run- 
time system for the efficient implementation of adaptive applications on large- 
scale parallel systems. Figure d depicts the architecture of the overall system 
and its layers that address the different requirements of such an application. 

The first layer, the Data-Movement and Control Substrate (DMCS) P pro- 
vides thread-safe, one-sided communication. DMCS implements an application 
programming interface (API) proposed by the PORTS consortium; it resembles 
Nexus P and Tulip P|, two other mid-level communication systems that imple- 
ment similar API’s with different design philosophies and objectives. It has been 
implemented on IBM’s Low-level Application Programming Interface (LAPI) P 
on the SP2. For data-movement operations, our measurements on an SP2 show 
that the overhead of DMCS is very close (within 10% for both puts and gets) of 
the overhead of LAPI. 

The second layer, the Mobile Object Layer (MOL), will be described in more 
detail in Sect. El The MOL supports a global addressing scheme designed for ob- 
ject mobility, and provides a correct and efficient protocol for message forwarding 
and communication between migrating objects. We describe a parallel adaptive 
mesh generator which uses the MOL as the run-time system for dynamic load 
balancing and message passing in Sect.0 Preliminary performance data (Sect. 
E) suggest that the flexibility and general nature of the MOL’s approach for data 
migration do not cause undue overhead. However, the MOL does not provide 
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support for efficient shared memory management, or implement policies specify- 
ing how and when mobile objects must be moved. The MOL implements some 
functionality provided by distributed-shared memory (DSM) systems (Sect.Ej). 
We conclude with a summary of our current work, and a description of plans to 
improve the functionality of the MOL, in Sects. Eland 0 

2 The Mobile Object Layer 

The Mobile Object Layer provides tools to build migratable, distributed data 
structures consisting of mobile objects linked via mobile pointers to these objects. 
For example, a distributed graph can be constructed using mobile objects for 
nodes and mobile pointers to point to neighboring nodes. If a node in such 
a structure is moved from one processor to another, the MOL guarantees El 
that messages sent to the node will reach it by forwarding them to the node’s 
new location. The MOL’s forwarding mechanism assumes no network ordering 
and allows network delays of arbitrary length to halt message reception. Also, 
forwarding only affects the source and target processors of a message to a mobile 
object; this “lazy” updating minimizes the communication cost of moving an 
object. The sequence number contained in the Moveinfo structure passed with 
a mobile object is compared to the sequence number contained in the target 
processors local directory |0|. In this way, the MOL prevents old updates from 
overwriting newer ones. 

The MOL uses a distributed directory structure which allows fast local access 
to the locations of mobile objects. Each processor maintains its own directory; 
each directory entry corresponds to the processor’s “best guess” of the corre- 
sponding object’s location. Compared to a central directory, this method redu- 
ces network traffic, but introduces the problem of maintaining global directory 
consistency in the presence of object migration. As an alternative to the expen- 
sive, non-scalable method of broadcasting directory updates to all processors, 
the MOL implements a ’’lazy” updating scheme, allowing some directory entries 
to be out of date. A processor communicates with a particular mobile object 
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by sending messages to the “best guess” location given by its local directory. If 
this location is incorrect, the sending processor is informed of the object’s true 
location; only processors that show an explicit interest in an object are updated 
with the object’s correct location. 

Although the MOL provides mechanisms to support mobile objects, there 
are no policies specifying how, when, and to where the mobile objects must 
be moved. It is the responsibility of application-specific software to coordinate 
object migration. This allows the MOL to support many systems, since no single 
migration policy could efficiently satisfy the needs of a broad range of irregular, 
parallel applications. The MOL’s flexibilty and low-overhead interface make it 
an efficient run-time system on which application-specific libraries and languages 
can be built. 

2.1 Mobile Pointers and Distributed Directories 

The basic building block provided by the MOL is the “mobile pointer.” A mobile 
pointer consists of two integer numbers: a 16-bit processor number, which speci- 
fies where the corresponding object was originally allocated (the “home node”), 
and a 32-bit index number which is unique on the object’s home node. This pair 
forms an identifier for a mobile object which is unique on every processor in the 
system and which can be passed as data in messages without extra help from the 
MOL. Using the mobile pointer, the corresponding object’s “best guess” location 
can be retrieved from a processor’s directory. 

A directory is a two dimensional array of directory entries; an entry for any 
mobile pointer can be located by indexing into the directory with the mobile 
pointer’s home node and index number. A mobile object’s directory entry con- 
sists of a 16-bit processor number containing the object’s “best guess” location, 
a 16-bit sequence number indicating how up to date the best guess is, and a 
32-bit physical pointer to the object’s data. The pointer can be retrieved using 
mob_deref(), which returns NULL if the object is not physically located on the 
requesting processor. 

There are three possibilities for the state of an object’s location as shown by 
the object’s directory entry. First, the object may reside on the current proces- 
sor, in which case the message can be handled locally. Second, the object may 
reside at a remote processor; in this case, the message is sent to the processor 
indicated by the directory entry. If the target processor does not contain the ob- 
ject, it will forward the message to the best guess location given by the object’s 
local directory entry. Third, the directory may not have an entry for the mobile 
pointer. In this case, the mobile pointer’s home node is used as the default best 
guess location for the object. 

2.2 Message Layer 

Our implementation of the MOL builds its own message layer on top of the Data 
Movement and Control Substrate (DMCS), which is in turn built on top of IBM’s 
Low-level API (LAPI) gj. In order to make the MOL as flexible and as portable 
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as possible, versions also exist which use Active Messages (AM) or NEXUS j2] 
as the underlying transport mechanism. In our implementation, we use incoming 
and outgoing pools of messages, similar to those used in the Generic Active Mes- 
sage specification from Berkeley (3- The MOL supports “processor requests” 
with moh-request() and “object messages” with moh-message(), to transfer mes- 
sages smaller than 1024 bytes to processors and mobile objects, respectively. 
1024 bytes was empirically analytically found to be the maximum message size 
for which store-and-forwarding is more efficient than three-way-rendezvous for 
forwarded messages on the SP2. Both types require a user-supplied procedure 
to handle the message when it arrives at its destination. 

Three types of handlers are available to process a message initiated by the 
MOL. The first type is a function handler, which is similar to an AM handler. 
This is the fastest handler type, since it is executed immediately upon being 
received and processed by the MOL, but neither communication nor context 
switching is allowed within the handler. Second, the message may be processed 
from within a delayed handler, which is queued internally by the MOL. The 
delayed handler is slower but also more flexible, in that communication, but not 
context switching, is allowed from inside the handler. Third, a threaded handler 
can spawn a thread to process the message. This is the most flexible handler type, 
but it is also the slowest. Since each of these handlers may be appropriate in 
different situations, the MOL supports all three; the type of handler is specified 
as an argument to mob-request() and moh-message(). All handlers are passed 
the sending processor number, the physical address and length of the message 
data, and one user-defined argument. In addition, object message handlers are 
passed the mobile pointer and the local address of the mobile object. 

Although the MOL only directly supports small and medium sized messages 
(store and forwarding of large messages is inefficient) efficient large message 
protocols can be built using MOL messaging. As a simple example, suppose the 
user wishes to create a large “get” operation directed to a mobile object. This 
can be done simply by creating a local buffer to hold incoming data, and then 
sending an object message including the buffer’s address to the target object. 
The delayed or threaded remote handler can then call a store procedure like AM’s 
am_store() to save the requested data to the buffer in the originating process. 

Large message send/receive protocols, typically accomplished with a three 
way rendezvous, can also be implemented with the MOL. As in the first exam- 
ple, an object message is sent to the target object containing the amount of space 
to allocate. The remote handler then allocates the buffer and sends a request to 
the originating processor with the buffer’s address. Finally, the source processor 
transfers the data, using a store procedure such as am_store(). In this example, 
the object must be ’’locked” by the programmer to keep the object from mo- 
ving before the store operation completes, since the MOL does not control the 
migration of objects. 
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3 Application: Parallel Grid Generation 

The efficient implementation of an unstructured mesh generator on distributed 
memory multiprocessors requires the maintenance of complex and dynamic dis- 
tributed data structures for tolerating latency, minimizing communication and 
balancing processor workload. We describe an implementation of the parallel 
2D Constrained Delaunay Triangulation (CDT) which uses the Mobile Object 
Layer to simplify data migration for load balancing the computation. We have 
chosen a simple work-stealing method |H1 E] to demonstrate the effectiveness of 
the MOL as the bookkeeper for message-passing when data is migrated by the 
load balancing module. 

Constrained Delaunay Triangulation. The mesh generator uses a Constrained 
Delaunay Triangulation method uni to generate a guaranteed-quality mesh El- 
Given a precomputed domain decomposition, each subdomain is refined inde- 
pendently of the other regions, except at the interfaces between regions. For 2D 
meshes, the extent of the refinement is defined by “constrained” interface and 
boundary edges. If a boundary or interface edge is part of a triangle to be refined, 
that edge is split. Since interface edges are shared between regions, splitting an 
edge in one region causes the change to propagate to the region which shares 
the split edge. The target region is updated as if it had split the edge itself. 

Load Balancing with the MOL. The input to the mesh generator is a decompo- 
sition of a domain into some number of regions, which are assigned to processors 
in a way that maximizes data locality. Each processor is responsible for ma- 
naging multiple regions, since, in general, there will be an over-decomposition 
of the domain. Subsequently, imbalance can arise due to both unequal distri- 
bution of regions and large differences in computation (e.g. between high- and 
low-accuracy regions in the solution). 

The work-stealing load balancing method we implement maintains a counter 
of the number of work-units that are currently waiting to be processed, and 
consults a threshhold of work to determine when work should be requested from 
other processors When the number of work-units falls below the threshhold, 
a processor requests a sufficient amount of work to maintain consistent resource 
utlization. 

The regions can be viewed as the work-units or objects which the load ba- 
lancer can migrate to rebalance the computation. Using the MOL, each region is 
viewed as a mobile object; we associate a mobile pointer with each region, which 
allows messages sent to migrated regions to be forwarded to the new locations. 
The load balancing module can therefore migrate data without disrupting the 
message-passing in the computation. 

Data Movement Using the MOL. The critical steps in the load balancing phase 
are the region migration, and the updates for the edge split messages between 
regions. To move a region, the MOL requires that mobjuninstallObj be called 
to update the sending processor’s local directory to reflect the pending change 
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in the region’s location. Next, a programmer-supplied procedure is used to pack 
the region’s data into a buffer, which must also contain the region’s mobile 
pointer and the 4-byte Moveinfo structure returned by mobjwninstallObj to track 
the region’s migration. Then, a message-passing primitive (e.g. MPI^SEND) is 
invoked to transport the buffer, and another user-supplied procedure unpacks 
and rebuilds the region in the new processor. After the region has been unpacked, 
mob_installObj must be called with the region’s mobile pointer and the Moveinfo 
structure to update the new processor’s directory. 

Since the MOL is used to move data, standard message-passing primitives, 
like MPESEND, will not work to send split edge messages from one region to 
another, since regions can be migrated. The MOL will forward a split edge mes- 
sage sent with mob_message, and will update the sending processor’s directory 
so that, unless the target region moves again, subsequent messages will not be 
forwarded. 

4 Preliminary Performance Data 

We present results for mob-message, which allows messages to be sent to a mobile 
object via a mobile pointer, and for mob-request, which directs messages of 1024 
bytes or less (a parameterized value) to specific processors without explicitly 
requesting storage space on the target processor. In addition, we present data 
gathered from the parallel meshing application for both non-load balanced and 
load balanced runs at different percentages of imbalance in the computation. 

All measurements for mob-message and mob-request were taken on an IBM 
RISC System/6000 SP, using Active Messages jI3|. The benchmarks measured 
the per-hop latency of messages ranging from 8 to 1024 bytes, as compared to 
the equivalent amstore calls. The performance is very reasonable; the latency of 
mob-request is within about 11% of the latency of amstore, while mob -message's, 
latency is about 12% to 14% higher than amstore's latency. 

To illustrate the importance of the MOL’s updates, Fig.0shows the latency of 
messages that were forwarded once each time they were sent versus messages that 
were not forwarded. Not surprisingly, the latency of the forwarded messages was 
about twice as high as that of the unforwarded messages. In a real application, 
the overall (amortized) cost of forwarding is determined by how often an object 
moves versus how often messages are sent to the object, since messages are 
forwarded immediately after an object moves but not after the updates have 
been received. In the case of the mesh generator, a large number of split edge 
requests are sent, relative to the number of times a mesh region is migrated 
(see Fig. 0), resulting in a low amortized cost for forwarding split-edge requests. 
Figure0shows the performance of the MOL’s three types of handlers. The graph 
clearly shows that the overheads caused by the delayed and threaded handlers 
are fairly low relative to the functionality they add. 

The next set of graphs represents data for a parallel mesh with between 
100, 000 and 170, 000 elements, and for load imbalances of between 8 and 50 
percent. Each of the four processors in the system started with 16 regions. All 
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Fig. 2. Mobile object layer performance: forwarding overhead 



measurements were taken on an SP2, using a NEXUS implementation of the 
MOL. 

Figure 0 represents the minimum and maximum computation times on the 
four processors in the non-load balanced experiments. Given above each bar 
is the number of elements generated in the mesh for that particular run. Fi- 
gure El displays the maximum computation time for a series of load-balanced 
mesh computations which used the same initial mesh as the non-load balanced 
experiments. Each bar is broken down into the time spent triangulating regions, 
packing and servicing split edge requests, and forwarding messages to migrated 
regions, in order to show the minimal overhead of using the MOL’s forwarding 
mechanism. The tuple above each bar gives the number of split edge requests 
and the number of object migrations. 

5 Related Work 

Our run-time system provides global address space and supports object mobility, 
as do many other previously developed systems and high performance languages 
designed for irregular applications. Examples of such systems and languages 
are: ABC-b- 1- TreadMarks Charm-b- b uni, CllcLOSH — h im, CC++ lEi, 

Amber and CRL to mention a few. 

Chaos++ EH supports globally addressable objects an abstraction similar to 
mobile objects. In Chaos-b-b global objects are owned by a single processor and 
all other processors with data-dependencies to a global object possess shadow 
copies of the global object. The Mobile Object Layer does not use shadow objects; 
instead, it relies on an efficient message-forwarding mechanism to locate and 
fetch data from remote objects. 



78 



N. Chrisochoides et al. 




Fig. 3. Mobile object layer performance: handler overhead 



ABC++ |14| proposed an object migration mechanism that would allow an 
object to move away from its original “home node.” However, the proposed 
mechanism would have required communication with the home node each time 
a message is sent to the object. The MOL eliminates additional communication 
for every message, because its directories are automatically updated to keep track 
of where objects have migrated. Furthermore, MOL updates are not broadcast 
to all processors in the system, but are lazily sent out to individual processors 
as needed. The MOL protocol for dealing with updates correctly and efficiently 
is nontrivial, and goes beyond the proposals presented for ABC++. 

FLASH integrates both messge passing and shared memory into a single 
architecture. The key feature of the FLASH architecture is the MAGIC pro- 
grammable node controller which connects processor, memory, and network 
components at each node. MAGIC is an embedded processor which can be pro- 
grammed to implement both cache coherence and message passing protocols. 
The Mobile Object Layer, on the other hand, is designed to be a thin software 
layer which can exist independently from the underlying hardware. The MOL is 
isolated from the system hardware by the DMCS layer, which allows the MOL 
to be portable as well as tunable; the MOL can be tuned to extract maximal 
performance by providing a vendor-specific DMCS implementation. 

The MOL offers a substantial improvement over explicit message passing sy- 
stems, such as systems built using MPI. The primary functionality added by 
the MOL is the ability to send messages to mobile objects, not processors. In 
other words, with the MOL, it is possible to communicate with objects with- 
out knowledge of the object’s location. This greatly eases the burden placed on 
the developers of mobile, adaptive applications. The MOL hides the complexity 
involved with maintaining the validity of global pointers by employing a sepa- 
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ration of concerns philosophy; the DMCS layer maintains the correct message 
ordering, while the MOL maintains the causality of messages. User code would 
be responsible for the migration of objects and for the maintainance of the va- 
lidity of references to those objects, if it relied solely upon a message passing 
system, especially with one-sided communication protocols. 

At the other extreme of the continuum are page-based software DSM systems, 
in which a specific range of virtual memory is actually shared among all nodes in 
the parallel system. The MOL differs from these systems, in that the MOL needs 
no hardware, operating system, or compiler support. The MOL is implemented 
as a library that is linked with user code to provide some of the functionality of 
a DSM system. However, because it is a library, no complex interactions with 
low-level software or hardware are necessary; there is no interaction with the 
virtual memory system, the operating system, or physical memory busses. The 
trade-off for this simplicity is extra complexity of the programming model. Reads 
and writes to mobile objects do not utilize the same mechanism as do reads and 
writes to shared objects. 



6 Summary and Conclusions 



We have presented a run-time substrate to support the efficient data-migration 
required for the parallelization of adaptive applications. The runtime substrate 
automatically maintains the validity of global pointers as data migrates from one 
processor to another, and implements a correct and efficient message forwarding 
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and communication mechanism between the migrating objects. The flexibility 
of our approach combined with the low overhead makes Mobile Object Layer 
attractive for application developers as well as compiler writers. 

The MOL is specifically designed to reduce the amount of effort needed to 
efficiently and easily implement mobile, adaptive applications, and related sup- 
port libraries. The most obvious benefit is that the programmer need not worry 
about the details of correct message passing in the presence of data migration, 
since it is handled within the MOL. Otherwise, several thousand lines of code 
would be written to perform similar functions to the MOL, and possibly less effi- 
ciently. Hence, more effort can be devoted to developing the application, instead 
of lower- level message passing and updating primitives. 

The MOL is also lightweight, in that its latency is very close to that of the 
message layer it is built upon, even for forwarded messages. Thus, little is lost in 
the efficiency of an application relying upon the MOL to effect object migration. 
This is accomplished by only doing a minimal amount of extra computation 
to maintain the distributed directories and the incoming and outgoing message 
pools. However, the results of this minimalism show up in the MOL interface, 
which has a total of just six procedures. The current interface must be expanded 
and enhanced; a number of additions can be made to increase the capabilities of 
the MOL. 
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7 Future Work 

To further improve the MOL’s flexibility and efficiency, coherency protocols will 
be implemented as “plug and play” modules to maximize performance for a gi- 
ven application. Different applications benefit from different coherency models, 
so restricting the programmer to only a single coherency model unnecessarily 
degrades performance. By allowing user code to experiment with different pro- 
tocols, the MOL can be tweaked to obtain the highest level of performance on 
an application by application basis. 

All of the coherency protocols in use today fall into three categories: “lazy” 
updating, “eager” invalidation, and “eager” updating protocols. Each of these 
protocols results in a certain amount of network traffic under certain circum- 
stances, and therefore the best choice for which coherency protocol to use can 
vary from one application to another. For example, in situations where nodes 
communicate infrequently with migrated mobile objects, a lazy updating pro- 
tocol, such as the one that is currently implemented in the MOL, works better 
relative to an eager update or invalidation protocol. A lazy protocol avoids unne- 
cessary broadcasting of update or invalidation messages to nodes which no longer 
have an interest in a mobile object. 

However, if an eager protocol has hardware support, a software-based lazy 
protocol may not be the best choice. For example, in the FLASH m system, 
the network controllers are programmed to implement eager protocols. In the 
explicit updating protocol used by FLASH, all nodes that hold pointers to a 
shared memory region are automatically updated if the memory is invalidated. 

Finally, we will also change the dense, two-dimensional array directory struc- 
ture into a sparse data structure, such as a hashtable. A matrix data structure 
is non-scalable with respect to memory usage, although it supports very fast 
retrieval of mobile objects. Careful implementation of the sparse data structure 
will provide access speeds comparable to that of the matrix data structure. 

By affording programmers the ability to choose the coherency protocol, the 
MOL can be tailored to work more cohesively with the application. Support 
for multiple message sizes will also further this goal, and, along with replacing 
the dense directory structure with a sparse structure, will reduce the memory 
overhead imposed upon applications which utilize the MOL for message passing 
and object mobility. 
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Abstract. A flexible software package for data partitioning has been de- 
veloped. The package considers irregularly weighted structured grids and 
irregularly coupled structured multiblock grids. But, also unstructured 
partitioning can be addressed with the software tools. The software gives 
support for construction of different partitioning algorithms by compo- 
sition of low-level operations. Automatic partitioning methods are also 
included. The implementation is in Fortran 90 with an object-oriented 
design. The use of the package has been demonstrated by partitioning 
a grid for an oceanographic model and a multiblock grid modeling an 
expanding and contracting tube for airflow computations. 



1 Introduction 

Structured grids are commonly used in scientific computing, e.g. in airflow si- 
mulations, ocean modeling and electro-magnetic computations. For complicated 
geometries a set of block-structured grids, i.e. a composite grid, is needed. An 
alternative is to use unstructured grids. However, structured grids require less 
memory and efficient solving techniques can be more easily implemented pm 
For unstructured grid problems there are general partitioning methods that work 
well for many applications and a number of general software packages are availa- 
ble for these methods, for example Top/Domdec [3], Chaco P], Metis jS], Jostle 
1^, and Scotch [Zl. For structured grid problems we also have the constraint that 
the partitioning methods should yield structured partitions in order to preserve 
the efficiency of the solvers. Then, there is a trade-off between the structure and 
the load balance which depends very much on the application. Consequently, the 
software for partitioning is usually integrated in the parallel solver environment 
and can not easily be extracted and adapted to other problems or applications. 

Partitioning a single grid with a homogeneous workload is straightforward. 
The grid can simply be divided into equally sized rectangular blocks, one block 
per processor. This can easily be handled with a data parallel compiler, for exam- 
ple High Performance Fortran (HPF) jSj. The problem arises when the workload 
becomes irregular or we have irregular data dependencies. For a single struc- 
tured grid with an irregular workload in the domain there exist a number of 
block partitioning algorithms |0|, e.g. the Recursive Coordinate Bisection Me- 
thod which is perhaps the most known. However, these standard methods do 
not always give satisfactory results, as discussed in Sect. V>:a and we have deve- 
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loped a new approach to partition data for this class of problems. The idea is 
to compose an algorithm from a set of low-level operations. This will be the key 
issue for the software package discussed below. For composite grid problems, i.e. 
irregularly coupled regular grids, the subgrids are usually partitioned either over 
all processors one by one or distributed as they are to different processors. The 
success of these two approaches is limited and they work well only for certain 
applications. We have also developed a new partitioning strategy. Our approach 
is very flexible and give good results for different kinds of composite grids. The 
ideas originate from our partitioning approach for single structured grids. 

These two classes of problems, single structured grids with irregular workload 
and composite structured grids, are very much related and it appears that they 
can be solved with related algorithms and software. The partitioning software 
we describe in this paper address this issue and give support for previous stra- 
tegies as well as the new algorithms proposed by us. The emphasis is to give 
an overview of the software and its capabilities. The strategies and algorithms 
are more thoroughly described elsewhere m- The design of the package is very 
important for the flexibility. The software is written in Fortran 90 with an object 
oriented-design and is an independent package that can easily be used in various 
applications. The object-oriented design provides a new way to construct par- 
titioning algorithms by composition of low-level operations. The result is that 
we have a software package that is flexible enough to address a large variety of 
applications. The user can choose a suitable algorithm or even compose a new 
one from the low-level building blocks. 

The rest of the paper has the following outline. The software package and 
the ideas are presented in Sect. El Then in Sect. 0 we briefly describe the kind 
of partitioning strategies that can be addressed with our software. In Sect. 0 we 
discuss some applications and finally in Sect. 0 we summarize our contributions. 



2 The Software 

The partitioning software is implemented in Fortran 90. It has an object-oriented 
design. This means that we have created abstract data types and encapsulated 
both the data and functionality into one entity, a class. The data can only be 
affected through the predefined operations within the class. This yields a disci- 
plined way to program and use the software tools. But, it also gives a natural and 
flexible way to create partitioning algorithms by composition of the predefined 
operations. Even though Fortran 90 is not a fully object-oriented language it 
gives good support for object-oriented programming with the type, module and 
interface concepts HH. Inheritance is not directly supported so it can be argued 
that the Fortran 90 implementation remains object-based. 

We have three classes that address different kinds of problems, (i) graph for 
graph partitioning problems or unstructured grids, (ii) domain decomposition 
(dd) for structured irregular workload problems, and (iii) composite domain de- 
composition (odd) for irregularly coupled structured grids. In addition, we have 
a corresponding set of derived composite data types to facilitate the communica- 
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tion with the classes giving cleaner interfaces. The user only has to understand 
how these derived data types are constructed to adapt the partitioning software 
to his or her application. The software infrastructure is illustrated in Fig. ^ 



APPLICATION 




Fig. 1. System infrastructure. We have a module (dashed frame) for block-structured 
partitioning, part^mod, and a module for graph partitioning, graph^mod. The sepa- 
rate modules communicate through simple communication objects, cgdata, griddata, 
laplmatrix, weights, and part_array which also are encapsulated in their own module 



The graph class contains operations for partitioning a graph. Here, we have 
implemented the Recursive Spectral Bisection method, a Greedy-like partitio- 
ning method, and a local refinement method as operations in the class. A graph 
is created from a Laplacian matrix and the vertex weights. As a result we can ac- 
cess the partitioning of the vertices. This is a separate and independent module 
that can be used directly in an application. 

The class domain decomposition contains a set of low-level operations to 
partition and decompose an irregularly weighted domain into a set of structured 
data blocks. The operations in this class can be divided into three categories: 

Splitting: Here we have implemented methods to split the domain into either 
equal sized parts or equal weighted parts. We have also an operation that 
recursively splits blocks depending on their relative sizes or weights. In ad- 
dition, we allow the user to define its own splitting of the domain. Then 
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the user can easily incorporate her favorite partitioning algorithm into our 
framework. 

Mapping: This category includes operations to map the data to the proces- 
sors. We have implemented various partitioning methods that distributes a 
connected set of data, in this case the blocks, to the processors. So far only 
Cyclic distribution. Recursive Spectral Bisection, Recursive Coordinate Bis- 
ection, and a local refinement method are included but the framework can 
easily be extended with additional algorithms. 

Assembly: Finally, we have operations to assemble the data or the blocks within 
the partitions. If two blocks within the same partitions can be merged to a 
larger rectangular block then this is more advantageous. We have also an 
operation to shrink blocks by moving the boundaries inwards if possible (see 
discussion in Sect. i;t. 211 . 

The principle to partition a grid is to first split the domain in at least as many 
blocks as there are processors, map the resulting data blocks to the processors, 
and finally assemble the data within the partitions to reduce the number of 
data items and connections. This gives the user a variation range to compose 
a suitable partitioning algorithm by choosing and combining operations from 
the three categories above. Note, the order of the operations is not fixed. The 
operations can be applied logically in any order increasing further the flexibility. 
For example, after splitting and mapping the data we can again split the data 
within the partitions and then re-map or refine the corresponding partitions in 
a multi-level fashion. Moreover, there is also a set of access functions. We can 
extract a graph with the data blocks as vertices, partition the graph within the 
graph class, and impose the resulting partitioning back to the block structured 
decomposition of the domain. 

The class composite domain decomposition is an aggregate of several instances 
of the class dd, one for each subgrid. This means that we can access an indivi- 
dual subgrid and partition by using the methods in the class dd, i.e. partition 
the subgrids one by one using the single grid strategies. The odd class also has 
an overview of the whole composite grid and handles the connections between 
the subgrids. For example we have an algorithm to cluster the processors, one 
processor cluster for each subgrid. We can then partition the subgrids in the 
different clusters independently using any single grid partitioning strategy. The 
idea is similar to the concept of processor subsets in HPF-2 but here we have 
developed an optimal clustering algorithm with respect to the load balance bet- 
ween the clusters. Moreover, we can create a connectivity graph with data items 
from all subgrids (considering the inter-grid connections) . The data items or ver- 
tices in the graph can be the subgrids, the data blocks from splitting the grids, 
or even the individual grid points in the subgrids. This gives us lots of flexibility 
and we can compose partitioning algorithms ranging from a structured coarse 
grain parallelism down to an unstructured fine grain parallelism. 

For evaluation and post-processing purposes, the three classes generate stati- 
stics of the corresponding partitionings. This can also be used to create automatic 
and adaptive partitioning algorithms. Furthermore, data files are produced for 
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visualization in Matlab. (The Matlab post-processing scripts are included in the 
software package.) 

In summary, the dd class is the key component in the system. It includes 
low-level operations to create block-structured partitions. A composite domain 
can be represented and partitioned using a set of dd-objects. The cdd class has an 
overview of the composite grid and includes in addition some global partitioning 
operations. The graph class is an auxiliary class for the blocks-structured par- 
titioning applications. The separation makes it easy to incorporate other graph 
partitioning methods or to connect other graph partitioning packages to our 
system, such as Chaco or Metis. Only the code in the graph-module has to be 
adapted. 

3 Partitioning Strategies and Methods 

This section is a short overview of partitioning algorithms for the kind of pro- 
blems we address with our software. The section also serves to give some ad- 
ditional indication on how our tools are intended to be used. We have three 
subsections but the emphasis is on the two latter for the block-structured appli- 
cations. 



3.1 Graph Partitioning 

In graph partitioning a weighted graph is constructed with the data items as 
vertices. The edges in the graph correspond to the neighbor relations between 
the data items. The graph can be represented numerically with a Laplacian 
matrix. The graph is then split in subgraphs with a partitioning method giving 
the corresponding data partitioning. The goal is to get compact subgraphs with 
a minimal number of edge-cuts. 

The Recursive Spectral Bisection method HH partitions the graph by compu- 
ting the eigenvector corresponding to the second largest eigenvalue. This method 
is considered to give globally good partitioning results, but is computationally 
very demanding and the partitioning can be suboptimal in the fine details. The 
quality of the partitioning can often be improved with a local refinement algo- 
rithm. A local refinement method is also one building block in a multi-level me- 
thod jI3|. Multi-level methods decrease the complexity in the partitioning con- 
siderably and give often better results than the corresponding non-hierarchical 
methods. The vertex weights usually corresponds to the arithmetic work and 
can be very different for the different vertices. The load balance can then be 
considerably improved by ignoring the neighbor relations and by using some 
bin-packing method for the partitioning, e.g. variants of the Greedy partitioning 
method PI- These algorithms are also very fast. 

The three methods above are included in our graph-module. These operations 
are used to map a connected set of data blocks, originated from the structured 
applications below, to the processors. But, they can also be used directly in an 
application if a Laplacian matrix is provided. 
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3.2 Irregular Workload Problems 

We consider the problem of partitioning a rectangular array with an irregular 
workload. This problem arises for example in ocean modeling where the sea depth 
is inhomogeneous US]. For structured grids it is necessary to have structured 
partitions, i.e. blocks, to keep the efficiency in the solver. Also, this is a natural 
parallelization strategy. The original solver can then be reused on the blocks. To 
get a good load balance, the blocks can have different sizes or several blocks can 
be assigned to the same processors. The load balance is essential but a number 
of other, sometimes conflicting, objectives should also be met. We must consider 
the following issues in the partitioning: 

— The load should be as even as possible. 

— The number of communication points should be as low as possible. 

— There should be as few blocks as possible. 

— The blocks should be as dense as possible. 

The last item states that the blocks should contain as little area as possible where 
no computations are required. For example, we should minimize the “empty” 
points corresponding to land-points in ocean modeling. 

Recursive Coordinate Bisection is a fast and robust method to partition a 
rectangular domain into equally weighted blocks, but does not consider the last 
item above. The whole domain is partitioned into equally weighted blocks and 
consequently all blocks contain work. If the domain is partitioned in a number of 
blocks without regarding the workload some blocks may be completely “empty” 
and can be removed, reducing the number of unused points. The load balance 
will then depend on the number of blocks and their sizes. Many small blocks give 
a good load balance but also an increased overhead in moving data between the 
blocks. A strategy, used for example in partitioning the Baltic sea HHI , is to (i) 
divide the domain uniformly in a number of blocks, (ii) remove the empty blocks, 
(iii) split only the heavy blocks to smooth the workload between the blocks, (iv) 
map the resulting blocks to the processors using a graph partitioning method, 
(v) merge blocks within the partitions to larger rectangular blocks, and (vi) 
shrink the boundaries of the blocks if possible to reduce further the number of 
unused points. This approach includes the three types of operations, splitting, 
mapping, and assembly, which are supported in the partitioning software. 

3.3 Composite Grids 

The partitioning of composite grids is complicated by the irregular data depen- 
dencies between the subgrids and of the different subgrid sizes. The number of 
subgrids and their sizes can differ much in various applications. The partitioning 
strategy then becomes very dependent on the specific application. 

Composite grids exhibit two levels of parallelism, between the grids and wit- 
hin the grids. Exploiting either the coarse-grain parallelism, i.e. the subgrid 
level, or the fine-grain parallelism, i.e. the grid point level within the subgrids, is 
straightforward and is commonly used in computational fluid dynamics (CFD) 
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applications with multiblock grids. A further development of these two strate- 
gies is to divide the processors into subsets, one per subgrid, and to partition 
the grids within the different subsets of processors. The success of these me- 
thods depends very much on the number of subgrids and their sizes as well as 
the number of available processors. 

We have developed a new approach m where we exploit both levels of the 
inherent parallelism in an efficient way. We consider all couplings between the 
subgrids and at the same time balance the arithmetic workload very well. The 
partitioning algorithm can be composed by low-level operations. The basic idea 
to partition the complete composite grid on all the processors is to divide each 
element grid into a number of smaller data blocks, set up a connectivity graph 
for the blocks and apply a graph partitioning method. Then, merge small blocks 
into larger rectangular blocks within each processor and subgrid. To get a general 
and efficient algorithm, the partitioning can be done in a multilevel fashion. The 
subgrids can be split recursively until the number of blocks is larger than the 
number of processors. Then the blocks can be mapped to the processors using 
a graph partitioning method. To get an even better load balance the blocks can 
be split further, projecting the partitioning, and refining with a local refinement 
method. Finally, the blocks can be merged within the processors. 

The low-level operations are supported in the software package and the out- 
lined algorithm can easily be implemented by composition of the different ope- 
rations from the respective classes. In addition, the previous methods described 
above are available. This gives the user a freedom to choose the best possible 
algorithm from an asset of methods for her application. 



4 Applications 

With the software tools we can easily compose a partitioning algorithm that 
adapts to a specific application. As two realistic examples we have partitioned 
an oceanographic model for the Baltic sea and a multiblock grid used in com- 
putational fluid dynamics. 

The Baltic sea application contains an irregular workload due to varying 
sea depth and a large fraction of land points in the computational grid, see 
Fig. 2. We have used the strategy described in Sect. l.'t.2l to partition the Baltic 
sea. Compared to some standard partitioning methods our block decomposition 
gives very promising results, see Table 1. The unstructured partitioning with 
the Recursive Spectral Bisection method is not applicable unless substantial 
recoding of the solver is made. The Recursive Coordinate Bisection method has 
a high fraction of land- filled grid points, more than 90%, and a large number of 
edge-cuts. The straightforward uniform partitioning has to much load imbalance 
to be efficient. Our new block method gives a good compromise between these 
different requirements. 

The Swedish Meteorological and Hydrological Institute is currently in the 
process of parallelizing their operational model for the Baltic sea and will be 
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using the described partitioning approach and our software tools. No actual 
runtimes are yet available for this problem. 




Fig. 2. The Baltic Sea (left) with block-structured partitioning (right). We have irre- 
gular workload due to varying sea depth and land-filled grid points 



Table 1. Our method, Block, is compared with three standard methods, Uniform block 
distribution (same as the (BLOCK, BLOCK) distribution in HPF), Recursive Coordinate 
Bisection (RGB), and Recursive Spectral Bisection (RSB). Few blocks, a high fraction 
of active points, a low load imbalance ratio, and a small number of edge-cuts are 
preferable. However, these requirements are conflicting. 



Method 


Blocks 


Active 


Load 


Edge-cut 


Uniform 


6 


0.082 


1.78 


287 


RGB 


6 


0.082 


1.028 


287 


RSB 


2171 


1.00 


1.0015 


112 


Block 


32 


0.42 


1.042 


130 



The other application is a multiblock grid. We have five blocks modeling an 
expanding and contracting tube, see Fig. 0 We have solved the compressible 
Navier-Stokes equations for airflow in the tube. The code is parallelized with 
the Cogito software tools uni and we have partitioned the grids with our new 
partitioning approach described in Sect, [f.hl The numerical experiments^, see 

^ The numerical experiments were performed at the Edinburgh Parallel Computing 
Centre supported by the Training and Research on Advanced Computing Systems 
(TRACS) programme. 
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Fig. a show that our software tools produce partitionings that are good. We 
have almost 50% efficiency on the largest processor configuration. Still, this grid 
is quite small to run on all the 512 processors. 



4 



1 




5 



2 



(a) Data partitioning 




(b) Stream lines 

Fig. 3. Simulation of airflow in an expanding and contracting tube. We have partitioned 
the multiblock grid by decomposing the individual subgrids in smaller data blocks, 
setting up a graph with the blocks as vertices, and partitioning the graph with the 
Recursive Spectral Bisection method 



Our software gives support to implement the other strategies for partitio- 
ning composite grids as well but it will be difficult to compare them for the 
expanding and contracting tube case. A real-life application contains typically 
about ten times more subgrids than our example, e.g. the SAAB-2000 aircraft is 
modeled with 48 subgrids of different sizes and shapes. Therefore we have made 
the comparisons for a simpler application, the advection equations in 2D, but 
modeled the geometry with 20 subgrids. The results from the experiments are 
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Fig. 4. Speedup of the Navier-Stokes solver. We have five subgrids within total 450,000 
grid points and we use up to 512 processors on a Cray T3D. The grids are partitioned 
with the strategy described in Fig. 0 



shown in Fig. Q We can see that our method gives the best results except for 
large processor configurations where the clustering strategy is preferred. A more 
extensive study and comparison of the different partitioning strategies may be 
found elsewhere H2). 

The implementation of solvers that can handle the kind of partitions shown 
in Fig. 0is non-trivial. The communication will have an irregular pattern and 
special care must be taken to get an efficient implementation. The Cogito soft- 
ware tools are specially designed for this kind of applications and use MPI with 
persistent communications objects to handle the update of ghostcells. The HPF- 
2 specification includes support to implement the different partitioning strategies 
but compilers that include all the new features are not yet available. The KeLP 
infrastructure m can handle the irregular communication pattern for a single 
grid very efficiently. A cooperation is now initiated to develop abstractions in 
KeLP to give support for composite grids as well. 



5 Conclusions 

We have constructed a set of software tools in Fortran 90 with an object-oriented 
design. The object-oriented design yields a set of low-level operations that can be 
combined in any order to get a flexible partitioning algorithm. As an additional 
result, the partitioning framework also serves as a testbed to construct, evaluate, 
and visualize different partitioning algorithms. 

The software tools address both unstructured graph partitioning and structu- 
red block partitioning. The emphasis is on the latter category. The applications 
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Fig. 5. Speedup of the advection equation solver. We have 20 subgrids within total 
90,000 grid points and we use up to 128 processors on a Cray T3D. Four different 
partitioning strategies are compared, (1) graph: our new partitioning method, (2) single: 
partitioning the grids one at a time over all processors, (3) block: distributing the 
subgrids as they are to different processors, and (4) cluster: clustering the processors 
for the subgrids and partitioning the subgrids within the different processor clusters 



are for example ocean modeling and multiblock grids within CFD. The parti- 
tioning software provides a new way to compose a partitioning algorithm with 
interaction from the user. Traditional software packages try to give a universal 
solution to the partitioning problem, which may not be optimal from application 
to application, or is only limited to a specific problem. Our package is general 
enough to be efficient for a large variety of partitioning problems. We have shown 
this by partitioning data for an oceanographic model and a multiblock grid used 
in airflow computations. For both of these applications we have been able to 
construct new partitioning algorithms with our software that in some cases are 
better than the previous standard algorithms. 
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Abstract. The multiple minimum degree (MMD) algorithm and its va- 
riants have enjoyed more than 20 years of research and progress in gene- 
rating fill-reducing orderings for sparse, symmetric, positive definite ma- 
trices. Although conceptually simple, efficient implementations of these 
algorithms are deceptively complex and highly specialized. 

In this case study, we present an object-oriented library that implements 
several recent minimum degree-like algorithms. We discuss how object- 
oriented design forces us to decompose these algorithms in a different 
manner than earlier codes and demonstrate how this impacts the flexi- 
bility and efficiency of our C-|— I- implementation. We compare the per- 
formance of our code against other implementations in C or Fortran. 



1 Introduction 

We have implemented a family of algorithms in scientific computing, traditionally 
written in Fortran 77 or C, using object-oriented techniques and C-|— 1-. The 
particular family of algorithms chosen, the Multiple Minimum Degree (MMD) 
algorithm and its variants, is a fertile area of research and has been so for the last 
twenty years. Several significant advances have been published as recently as the 
last three years. Current implementations, unfortunately, tend to be specific to a 
single algorithm, are highly optimized, and are generally not readily extensible. 
Many are also not in the public domain. 

Our goal was to construct an object-oriented library that provides a labora- 
tory for creating and experimenting with these newer algorithms. In anticipation 
of new variations that are likely to be proposed in the future, we wanted the 
code to be extensible. The performance of the code must also be competitive 
with other implementations. 

These algorithms generate permutations of large, sparse, symmetric matrices 
to control the work and storage required to factor that matrix. We explain the 
details of how work and storage for factorization of a matrix depends on the 
ordering in Sect.|21 This is formally stated as the fill- minimization problem. Also 
in Sect. 0 we review the Minimum Degree algorithm and its variants emphasizing 

* This work was supported by National Science Foundation grants CCR-9412698 and 
DMS-9807172, by a GAANN fellowship from the Department of Education, and by 
NASA under Contract NASl-19480 
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recent developments. In Sect. 0we discuss the design of our library, fleshing out 
the primary objects and how they interact. We present our experimental results 
in Sect. 0 examining the quality of the orderings obtained with our codes, and 
comparing the speed of our library with other implementations. The exercise 
has led us to new insights into the nature of these algorithms. We provide some 
interpretation of the experience in Sect.0 



2 Background 

We illustrate the effect ordering has on the work and storage requirements of 
matrix factorization, translate this to a useful graph theoretic tool, explain the 
rationale in which heuristic algorithms attempt to control work and storage, 
and mention a specialized data structure common to all competitive Minimum 
Degree-like algorithms called the quotient graph. 



2.1 Sparse Matrix Factorization 

Consider a linear system of equations Ax = b, where the coefficient matrix A is 
sparse, symmetric, and either positive definite or indefinite. A direct method for 
solving this problem computes a factorization of the matrix A = LBL^ , where 
i is a lower triangular matrix, and S is a block diagonal matrix with 1 x 1 or 
2x2 blocks. 

The factor L is computed by setting Lq = A and then creating Lk+i by 
adding multiples of rows and columns of Lk to other rows and columns of Lk- 
This implies that L has nonzeros in all the same position^ as A plus some 
nonzeros in positions that were zero in A, but induced by the factorization. It is 
exactly these nonzeros that are called fill elements. The presence of All increases 
both the storage and work required in the factorization. 

An example matrix is provided in Fig. ^ that shows non-zeros in original 
positions of A as “x” and All elements as This example incurs two All 
elements. The order in which the factorization takes place greatly influences 
the amount of fill. The matrix A is often permuted by rows and columns to 
reduce the number of All elements, thereby reducing storage and flops required for 
factorization. Given the example in Fig.Q], the elimination order {2,6, 1,3, 4, 5} 
produces only one All element. This is the minimum number of All elements for 
this example. 

If A is positive definite, Cholesky factorization is numerically stable for any 
symmetric permutation of A, and the fill-reducing permutation need not be 
modified during factorization. If A is indefinite, then the initial permutation 
may have to be further modified during factorization for numerical stability. 

^ No “accidental” cancellations will occur during factorization if the numerical values 
in A are algebraic indeterminates. 
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done 



Fig. 1. Examples of factorization and fill. For each factorization step, k, there is the 
nonzero structure of the factor, Lj,, the associated elimination graph, Gk, and the 
quotient graph Qk- The elimination graph consists of vertices and edges. The quotient 
graph has edges and two kinds of vertices, supernodes (ovals) and enodes (boxed ovals). 



2.2 Elimination Graph 

The graph G of the sparse matrix A is a graph whose vertices correspond to the 
columns of A. We label the vertices 1, . . . ,n, to correspond to the n columns 
of A. An edge (i,f) connecting vertices i and j in G exists if and only if atj is 
nonzero. By symmetry, aj^i is also nonzero. 

The graph model of symmetric Gaussian elimination was introduced by Far- 
ter PJ. A sequence of elimination graphs, Gk, represent the fill created in each 
step of the factorization. The initial elimination graph is the graph of the matrix. 
Go = G{A). At each step k, let Vk be the vertex corresponding to the column 
of A to be eliminated. The elimination graph at the next step, Gk+i, is obtained 
by adding edges to make all the vertices adjacent to Vk pairwise adjacent to each 
other, and then removing Vk and all edges incident on Vk- The inserted edges are 
fill edges in the elimination graph. This process repeats until all the vertices are 
removed from the elimination graph. The example in Fig. Q] illustrates the graph 
model of elimination. Finding an elimination order that produces the minimum 
amount of fill is NP-complete |2|. 
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Table 1. Algorithms that fit into the Minimum Priority family 



Abbreviation 


Algorithm Name 


Primary Reference 


MMD 


Multiple Min. Degree 


Liu 0 


AMD 


Approximate Min. Degree 


Amestoy, Davis and Duff 0 


AMF 


Approximate Min. Fill 


Rothberg 0 


AMMF 


Approximate Min. Mean Local Fill 


Rothberg and Eisenstat 


AMIND 


Approximate Min. Increase in 
Neighbor Degree 


Rothberg and Eisenstat 


MMDF 


Modified Min. Deficiency 


Ng and Raghavan Q 


MMMD 


Modified Multiple Min. Degree 


Ng and Raghavan |7| 


2.3 Ordering Heuristics 





An upper bound on the fill that a vertex of degree d can create on elimination is 
d{d— 1)/2. The minimum degree algorithm attempts to minimize fill by choosing 
the vertex with the minimum degree in the current elimination graph, hence 
reducing fill by controlling this worst-case bound. In Multiple Minimum Degree 
(MMD), a maximal independent set of vertices of low degree are eliminated in 
one step to keep the cost of updating the graph low. 

Many more enhancements are necessary to obtain a practically efficient im- 
plementation of MMD. A survey article by George and Liu |B| provides the 
details. There have been several contributions to the field since the survey. A 
list of algorithms that we implement in our library and references are in Table ^ 
Most of these adaptations increase the runtime by 5-25% but reduce the amount 
of arithmetic required to generate the factor by 10-25%. 



2.4 The Quotient Graph 

Up to this point we have been discussing the elimination graph to model fill 
in a minimum priority ordering. While it is an important conceptual tool, it 
has difficulties in implementation arising from the fact that the storage required 
can grow like the size of the factor and cannot be predetermined. In practice, 
implementations use a quotient graph, Q, to represent the elimination graph in 
no more space than that of the initial graph G(A). A quotient graph can have 
the same interface as an elimination graph, but it must handle internal data 
differently, essentially through an extra level of indirection. 

The quotient graph has two distinct kinds of vertices: supernodes and en- 
A supernode represents a set of one or more uneliminated columns of 
A. Similarly, an enode represents a set of one or more eliminated columns of 
A. The initial graph, Qq, consists entirely of supernodes and no enodes; furt- 
her, each supernode contains one column. Edges are constructed the same as in 
the elimination graph. The initial quotient graph, Qq, is identical to the initial 
elimination graph, Gq. 

^ Also called “eliminated supernode” or “element” elsewhere. 
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When a supernode is eliminated at some step, it is not removed from the 
quotient graph; instead, the supernode becomes an enode. Enodes indirectly 
represent the fill edges in the elimination graph. To demonstrate how, we first 
define a reachable path in the quotient graph as a path (i, ei, 62 , . . . Gp,j), where 
i and j are supernodes in Qk and ei, 62 , . . . are enodes. Note that the number 
of enodes in the path can be zero. We also say that a pair of supernodes i,j 
is reachable in Qk if there exists a reachable path joining i and j. Since the 
number of enodes in the path can be zero, adjacency in Qk implies reachability 
in Qk- If two supernodes i,j are reachable in the quotient graph Qk, then the 
corresponding two vertices i,j in the elimination graph Gk are adjacent in Gk- 

In practice, the quotient graph is aggressively optimized; all non-essential 
enodes, supernodes, and edges are deleted. Since we are only interested in paths 
through enodes, if two enodes are adjacent they are amalgamated into one. So 
in practice, the number of enodes in all reachable paths is limited to either 
zero or one. Alternatively, one can state that, in practice, the reachable set of a 
supernode is the union of its adjacent supernodes and all supernodes adjacent 
to its adjacent enodes. This amalgamation process is one way how some enodes 
come to represent more than their original eliminated column. 

Supernodes are also amalgamated but with a different rationale. Two su- 
pernodes are indistinguishable if their reachable sets (including themselves) are 
identical. When this occurs, all but one of the indistinguishable supernodes can 
be removed from the graph. The remaining supernode keeps a list of all the 
columns of the supernodes compressed into it. When the remaining supernode 
is eliminated and becomes an enode, all its columns can be eliminated together. 
The search for indistinguishable supernodes can be done before eliminating a 
single supernode using graph compression |0|. More supernodes become indi- 
stinguishable as elimination proceeds. An exhaustive search for indistinguishable 
supernodes during elimination is prohibitively expensive, so it is often limited 
to supernodes with identical adjacency sets (assuming a self-edge) instead of 
identical reachable sets. 

Edges between supernodes can be removed as elimination proceeds. When a 
pair of adjacent supernodes share a common enode, they are reachable through 
both the shared edge and the shared enode. In this case, the edge can be safely 
removed. This not only improves storage and speed, but allows tighter approxi- 
mations to supernode degree as well. 

Going once more to Fig. Q, we consider now the quotient graph. Initially, 
the elimination graph and quotient graph are identical. After the elimination of 
column 1, we see that supernode 1 is now an enode. Note that unlike the eli- 
mination graph, no edge was added between supernodes 3 and 4 since they are 
reachable through enode 1. After the elimination of column 2, we have removed 
an edge between supernodes 5 and 6. This was done because the edge was re- 
dundant; supernode 5 is reachable from 6 through enode 2. When we eliminate 
column 3, supernode 3 becomes an enode, it absorbs enode 1 (including its edge 
to supernode 4). Now enode 3 is adjacent to supernodes 4, 5 and 6. The fill edge 
between supernodes 4 and 5 is redundant and can be removed. At this point 4, 
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k^Q 

while k < n 

Let m be the minimum known degree, deg(a;), of all x G Gk- 
while m is still the minimum known degree of all x £ Gk 
Choose supernode Xk such that deg(a;;i) = m 
for all of the p columns represented by supernode Xk’- 
Number columns (fc + 1) . . . {k + p). 

Form enode et from supernode Xk and all adjacent enodes. 
for all supernodes x adjacent to Ck'- 
Label deg(a;) as “unknown.” 
fc <— A: + p 

for all supernodes x where deg(®) is unknown: 

Update lists of adjacent supernodes and enodes of x. 

Check for various QuotientCraph optimizations. 

Compute deg(a;). 



Fig. 2. The multiple minimum degree algorithm defined in terms of a quotient graph 



5, and 6 are indistinguishable. However, since we cannot afford an exhaustive 
search, a quick search (by looking for identical adjacency lists) finds only su- 
pernodes 5 and 6 so they are merged to supernode {5,6}. Then supernode 4 
becomes an enode and absorbs enode 3. Finally supernode {5,6} is eliminated. 
The relative order between columns 5 and 6 has no effect on fill. 

We show the Multiple Minimum Degree algorithm defined in terms of a quo- 
tient graph in Fig. 0 A single elimination Minimum Degree algorithm is similar, 
but executes the inner while loop only once. We point out that we have not 
provided an exhaustive accounting of quotient graph features and optimizati- 
ons. Most of the time is spent in the last three lines Fig. 13 and often they are 
tightly intertwined in implementations. 



3 Design 



To provide a basis for comparison, we briefly discuss the design and implemen- 
tation characteristics of MMD Pj and AMD P|. Both implementations were 
written in Fortran?? using a procedural decomposition. They have no dynamic 
memory allocation and implement no abstract data types in the code besides 
arrays. 

GENMMD is implemented in roughly 500 lines of executable source code 
with about 100 lines of comments. The main routine has 12 parameters in its 
calling sequence and uses four subroutines that roughly correspond to initia- 
lization, supernode elimination, quotient graph update/degree calculation, and 
finalization of the permutation vector. The code operates in a very tight foot- 
print and will often use the same array for different data structures at the same 
time. The code has over 20 goto statements and can be difficult to follow. 
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// Major Classes 

QuotientGraph* qgraph; 

BucketSorter* sorter; 

Prior ityStrategy* priority; 

SuperNodeList* reachableSuperNodes , * mergedSuperNodes ; 

// Initialization... 

// Load all vertices into sorter 

1. priority->computeAndInsert (priority : :ALL_N0DES, qgraph, sorter); 

2. if ( priority->requireSingleElimination() == true ) 

3. maxStep = 1 ; 
else 

4. maxStep = graph->size() ; 

// Main loop 

5. while ( sorter->notEmpty () ) { 

6. int min = sorter->queryMinNonemptyBucket () ; 

7. int step = 0; 

8. while ( ( min == sorter->queryMinNonemptyBucket () && 

( step < maxStep ) ) { 

9. int snode = sorter->removeItemFromBucket ( min ) ; 

10. qgraph->eliminateSupernode( snode ); 

SuperNodeList* tempSuperNodes ; 

11. tempSuperNodes = qgraph->queryReachableSet ( snode ); 

12. sorter->removeSuperNodes ( tempSuperNodes ); 

13. *reachableSuperNodes += *tempSuperNodes ; 

14. ++step; 

} 

15. qgraph->update( reachableSuperNodes, mergedSuperNodes ); 

16. sorter->removeSuperNodes ( mergedSuperNodes ); 

17. priority->computeAndInsert ( reachableSuperNodes, 

qgraph, sorter ) ; 

18. mergedSuperNodes->resize( 0 ); 

19. reachableSuperNodes->resize( 0 ); 

} 



Fig. 3. A general Minimum Priority Algorithm using the objects described in Fig. 0 
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— Quotient Graph 

1. Must provide a method for extracting the Reachable Set of a vertex. 

2. Be able to eliminate supernodes on demand. 

3. Should have a separate lazy update method for multiple elimination. 

4. Should provide lists of compressed vertices that can be ignored for the rest of 
the ordering algorithm. 

5. Must produce an elimination tree or permutation vector after all the vertices 
have been eliminated. 

6. Should allow const access to current graph for various Priority Strategies. 

— Bucket Sorter 

1. Must remove an item from the smallest non-empty bucket in constant time. 

2. Must insert an item-key pair in constant time. 

3. Must remove an item by name from anywhere in constant time. 

— Priority Strategy 

1. Must compute the new priority for each vertex in the list. 

2. Must insert the priority-vertex pairs into the Bucket Sorter. 



Fig. 4. Three most important classes in a minimum priority ordering and some of their 
related requirements. 



AMD has roughly 600 lines of executable source code which almost doubles 
when the extensive comments are included. It is implemented as a single routine 
with 16 calling parameters and no subroutine calls. It is generally well structured 
and documented. Manually touching up our f2c conversion, we were able to 
easily replace the 17 goto statements with while loops, and break and continue 
statements. This code is part of the commercial Harwell Subroutine Library, 
though we report results from an earlier version shared with us. 

The three major classes in our implementation are shown in a basic outline 
in Fig. E] Given these classes, we can describe our fourth object; the Minimum- 
PriorityOrdering class that is responsible for directing the interactions of these 
other objects. The main method of this class (excluding details, debugging sta- 
tements, tests, comments, etc.) is approximately the code fragment in Fig. 0 By 
far the most complicated (and expensive) part of the code is line 15 of Fig. 0 
where the graph update occurs. 

The most elegant feature of this implementation is that the PriorityStrategy 
object is an abstract base class. We have implemented several derived classes, 
each one implementing one of the algorithms in Table E Each derived class 
involves overriding two virtual functions (one of them trivial) . The classes derived 
from PriorityStrategy average 50 lines of code each. This is an instance of the 
Strategy Pattern cni. 

The trickiest part is providing enough access to the QuotientGraph for the 
PriorityStrategy to be useful and extensible, but to provide enough protection 
to keep the PriorityStrategy from corrupting the rather complicated state infor- 
mation in the QuotientGraph. 

Because we want our library to be extensible, we have to provide the Pri- 
orityStrategy class access to the QuotientGraph. But we want to protect that 
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access so that the QuotientGraph’s sensitive and complicated internal workings 
are abstracted away and cannot be corrupted. We provided a full-fledged iterator 
class, called ReachableSetIterator, that encapsulated the details of the Quotient- 
Graph from the PriorityStrategy, making the interface indistinguishable from an 
EliminationGraph. 

Unfortunately, the overhead of using these iterators to compute the priorities 
was too expensive. We rewrote the PriorityStrategy classes to access the Quo- 
tientGraph at a lower level . . . traversing adjacency lists instead of reachable 
sets. This gave us the performance we needed, but had the unfortunate effect of 
increasing the coupling between classes. However, the ReachableSetIterator was 
left in the code for ease of prototyping. 

Gurrently we have implemented a PriorityStrategy class for all of the algo- 
rithms listed in Table E They all compute their priority as a function of either 
the external degree^ or a tight approximate degree^ of a supernode. Gomputing 
the external degree is more expensive, but allows multiple elimination. For tech- 
nical reasons, to get the approximate degree tight enough the quotient graph 
must be updated after every supernode is eliminated, hence all algorithms that 
use approximate degree are single elimination algorithm^ For this reason, all 
previous implementations are either multiple elimination codes or single elimina- 
tion codes, not both. The quotient graph update is the most complicated part of 
the code and single elimination updates are different from multiple elimination 
updates. 

The MinimumPriorityOrdering class queries the PriorityStrategy whether it 
requires quotient graph updates after each elimination or not. It then relays this 
information to the QuotientGraph class which has different optimized update 
methods for single elimination and multiple elimination. The QuotientGraph 
class can compute partial values for external degree or approximate degree as a 
side-effect of the particular update method. 

Given this framework, it is possible to modify the MinimumPriorityOrde- 
ring class to switch algorithms during elimination. For example, one could use 
MMD at first to create a lot of enodes fast, then switch to AMD when the quo- 
tient graph becomes more tightly connected and independent sets of vertices 
to eliminate are small. There are other plausible combinations because diffe- 
rent algorithms in Table ^prefer vertices with different topological properties. 
It is possible that the topological properties of the optimal vertex to eliminate 
changes as elimination progresses. 



4 Results 

We compare actual execution times of our implementation to an f2c conversion 
of the GENMMD code by Liu P|. This is currently among the most widely 

® Readers are cautioned that algorithms in Tabled that approximate quantities other 
than degree could be multiple elimination algorithms. Rothberg and Eisenstat 
have defined their algorithms using either external degree (multiple elimination) or 
approximate degree (single elimination). 
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Table 2. Comparison of quality of various priority policies. Quality of the ordering 
here is measured in terms of the amount of work to factor the matrix with the given 
ordering. Refer to Table d for algorithm names and references 



problem 


Work 

MMD 


Work (normalized) 

AMD AMF AMMF AMIND MMDF MMMD 


1. commanche 


1.76e-b06 


1.00 


.89 


.87 


.87 


.92 


.89 


2. barthd 


4.12e-b06 


1.00 


.89 


.83 


.82 


.86 


.82 


3. barth 


4.55e-b06 


1.02 


.90 


.84 


.85 


.91 


.89 


4. fordl 


1.67e-b07 


.98 


.84 


.87 


.82 


.89 


.86 


5. kenl3 


1.84e-b07 


1.01 


.89 


.88 


.96 


.83 


.87 


6. barthS 


1.96e-b07 


1.00 


.90 


.81 


.82 


.72 


.83 


7. shuttle_eddy 2.76e-l-07 


.97 


.87 


.74 


.74 


.75 


.81 


8. bcsstklS 


1.37e-b08 


.98 


.77 


.78 


.74 


.86 


.83 


9. bcsstkl6 


1.56e-b08 


1.02 


.81 


.84 


.82 


.82 


.81 


10. bcsstk23 


1.56e-b08 


.95 


.79 


.73 


.75 


.80 


.81 


11. bcsstklS 


1.74e-b08 


.97 


.89 


.84 


.81 


.84 


.86 


12. bcsstkl7 


2.22e-l-08 


1.10 


.89 


.85 


.88 


1.02 


.89 


13. pwt 


2.43e-b08 


1.03 


.92 


.87 


.90 


.88 


.90 


14. ford2 


3.19e-b08 


1.03 


.76 


.72 


.70 


.77 


.77 


15. bcsstkSO 


9.12e-b08 


1.01 


.97 


.82 


.79 


.88 


.87 


16. tandemwtx 


1.04e-b09 


.97 


.77 


.56 


.66 


.70 


.77 


17. pdslO 


1.04e-b09 


.90 


.88 


.91 


.87 


.88 


1.00 


18. copterl 


1.33e-b09 


.96 


.82 


.62 


.71 


.79 


.87 


19. bcsstkSl 


2.57e-b09 


1.00 


.95 


.67 


.71 


.94 


.87 


20. nasasrb 


5.47e-b09 


.95 


.82 


.70 


.79 


.93 


.82 


21. skirt 


6.04e-b09 


1.11 


.83 


.90 


.76 


.88 


.83 


22. tandem_dual 8.54e-|-09 


.97 


.42 


.51 


.62 


.72 


.72 


23. onera_dual 


9.69e-b09 


1.03 


.70 


.48 


.57 


.65 


.71 


24. copter2 


1.35e-b09 


.97 


.73 


.50 


.61 


.66 


.69 


geometric mean 


1.00 


.84 


.74 


.77 


.83 


.83 


median 




1.00 


.85 


.82 


.79 


.85 


.83 



used implementations. In general, our object-oriented implementation is within 
a factor of 3-4 of GENMMD. We expect this to get closer to a factor of 2-3 as 
the code matures. We normalize the execution time of our implementation to 
GENMMD and present them in Table 0 For direct comparison, pre-compressing 
the graph was disabled in our G-l— I- code. We also show how our code performs 
with compression. 

All runtimes are from a Sun UltraSPARG-5 with 64MB of main memory. 
The software was compiled with GNU G-l— I- version 2.8.1 with the -0, and 
-fno-exceptions flags set. The list of 24 problems are sorted in nondecreasing 
order of the work in computing the factor with the MMD ordering. The numbers 
presented are the average of eleven runs with different seeds to the random num- 
ber generator. Because these algorithms are extremely sensitive to tie-breaking, 
it is common to randomize the graph before computing the ordering. 
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Table 3. Relative performance of our implementation of MMD (both with and with- 
out precompression) to GENMMD. GENMMD does not have precompression. The 
problems are sorted in nondecreasing size of the Cholesky factor 



problem 


|V| 


\E\ 


time (seconds) 
GENMMD 
no compr. 


time( normalized ) 
C++ 

no compr. compr. 


1. commanche 


7,920 


11,880 


.08 


5.88 


5.81 


2. barthl 


6,019 


17,473 


.06 


6.17 


6.42 


3. barth 


6,691 


19,748 


.10 


5.00 


5.36 


4. fordl 


18,728 


41,424 


.30 


4.57 


4.69 


5. kenl3 


28,632 


66,486 


3.61 


.92 


.94 


6. barthS 


15,606 


45,878 


.28 


4.96 


4.97 


7. shuttle_eddy 


10,429 


46,585 


.09 


9.44 


9.33 


8. bcsstklS 


11,948 


68,571 


.44 


4.59 


4.89 


9. bcsstklb 


4,884 


142,747 


.16 


8.19 


1.74 


10. bcsstk23 


3,134 


21,022 


.22 


4.32 


4.34 


11. bcsstklS 


3,948 


56,934 


.22 


4.77 


4.62 


12. bcsstkl7 


10,974 


208,838 


.30 


5.97 


2.33 


13. pwt 


36,519 


144,794 


.58 


6.16 


6.32 


14. ford2 


100,196 


222,246 


2.44 


3.84 


3.90 


15. bcsstk30 


28,924 


1,007,284 


.95 


5.79 


1.67 


16. tandem_vtx 


18,454 


117,448 


.85 


4.11 


4.13 


17. pdslO 


16,558 


66,550 


107.81 


1.24 


1.16 


18. copter 1 


17,222 


96,921 


.67 


6.22 


6.52 


19. bcsstk31 


35,588 


572,914 


1.50 


4.83 


2.58 


20. nasasrb 


54,870 


1,311,227 


2.06 


6.14 


2.44 


21. skirt 


45,361 


1,268,228 


2.03 


6.38 


1.72 


22. tandem_dual 


94,069 


183,212 


4.50 


3.70 


3.67 


23. onera_dual 


85,567 


166,817 


4.23 


3.65 


3.69 


24. copter2 


55,476 


352,238 


3.96 


4.57 


4.70 




geometric mean 




4.61 


3.53 




median 






4.90 


4.24 



We refer the reader to Table |2I for relative quality of orderings and execution 
times. As with the previous table, the data represents the average of 11 runs 
with different seeds in the random number generator. The relative improvement 
in the quality of the orderings over MMD is comparable with the improvements 
reported by other authors, even though the test sets are not identical. 

We have successfully compiled and used our code on Sun Solaris workstations 
using both SunPRO C-I-+ version 4.2 and GNU C-I-+ version 2. 8. 1.1. The code 
does not work on older versions of the same compilers. We have also compiled 
our code on Windows NT using Microsoft Visual C-I-+ 5.0. 
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5 Conclusions 

Contrary to popular belief, our implementation shows that the most expensive 
part of these minimum priority algorithms is not the degree computation, but the 
quotient graph update. With all other implementations — including GENMMD 
and AMD — the degree computation is tightly coupled with the quotient graph 
update, making it impossible to separate the costs of degree computation from 
graph update with any of the earlier procedural implementations. The priority 
computation (for minimum degree) involves traversing the adjacency set of each 
reachable supernode after updating the graph. Updating the graph, however, 
involves updating the adjacency sets of each supernode and enode adjacent to 
each reachable supernode. This update process often requires several passes. 

By insisting on a flexible, extensible framework, we required more decoupling 
between the priority computation and graph update: between algorithm and data 
structure. In some cases, we had to increase the coupling between key classes 
to improve performance. We are generally satisfied with the performance of our 
code and with the value added by providing implementations of the full gamut 
of state-of-art algorithms. We will make the software publicly available. 
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Abstract. High-performance scientific computing relies increasingly on 
high-level, large-scale, object-oriented software frameworks to manage 
both algorithmic complexity and the complexities of parallelism: dis- 
tributed data management, process management, inter-process commu- 
nication, and load balancing. This encapsulation of data management, 
together with the prescribed semantics of a typical fundamental compo- 
nent of such object-oriented frameworks — a parallel or serial array class 
library — provides an opportunity for increasingly sophisticated compile- 
time optimization techniques. This paper describes two optimizing trans- 
formations suitable for certain classes of numerical algorithms, one for re- 
ducing the cost of inter-processor communication, and one for improving 
cache utilization; demonstrates and analyzes the resulting performance 
gains; and indicates how these transformations are being automated. 



1 Introduction 

Current ambitions and future plans for scientific applications, in part stimulated 
by the Accelerated Scientific Computing Initiative (ASCI), practically mandate 
the use of higher- level approaches to software development, particularly more 
powerful organizational and programming tools and paradigms for managing al- 
gorithmic complexity, making parallelism largely transparent, and more recently, 
implementing methods for code optimization that could not be reasonably ex- 
pected of a conventional compiler. 

An increasingly popular approach is the use of C-|— I- object-oriented software 
frameworks or hierarchies of extensible libraries. The use of such frameworks has 
greatly simplified (in fact, made practicable) the development of complex serial 
and parallel scientific applications at Los Alamos National Laboratory (LANL) 
and elsewhere. Examples from LANL include Overture Q and POOMA [2j. 

Concerns about performance, particularly relative to FORTRAN 77, are the 
single greatest impediment to widespread acceptance of such frameworks, and 
our (and others’) ultimate goal is to produce FORTRAN 77 performance (or 
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better, in a sense described later) from the computationally intensive compo- 
nents of such C-|— I- frameworks, namely their underlying array classes. There 
are three broad areas where potential performance, relative to theoretical ma- 
chine capabilities, is lost: language implementation issues (which we address for 
C-|— I- elsewhere 0), communication, and with the trend toward ever-deeper me- 
mory hierarchies and the widening differences in processor and main-memory 
bandwidth, poor cache utilization. 

Experience demonstrates that optimization of array class implementations 
themselves is not enough to achieve desired performance; rather, their use must 
also be optimized. One approach, championed by the POOMA project (and 
others), is the use of expression templates 0. Another, being pursued by us, is 
the use of an optimizing preprocessor. 

We present optimizing transformations applicable to stencil or stencil-like 
operations which can impose the dominant computational cost of numerical al- 
gorithms for solving PDEs. The first is a parallel optimization which hides com- 
munication latency. The second is a serial optimization which greatly improves 
cache utilization. These optimizations dovetail in that the first is required for 
the second to be of value in the parallel case. Last is an outline of an ongoing 
effort to automate these (and other) transformations in the context of parallel 
object-oriented scientific frameworks. 



2 Array Classes 

In scientific computing arrays are the fundamental data structure, and as such 
compilers attempt a large number of optimizations for their manipulation p| . For 
the same reason, array class libraries are ubiquitous fundamental components of 
object-oriented frameworks. Examples include A-| — I-/P-I— I- ^ in Overture, valar- 
ray in the C-| — h standard library 0 , Template Numerical ToolKit (TNT) 0 , the 
GNU Scientific Software Library (GNUSSL) 0, and an unnamed component of 
POOMA. 

The target of transformation is the A-|— I-/P-I--I- array class library which 
provides both serial and parallel array implementations. Transformation (and 
distribution) of A-|— I-/P-I— I- array statements is practicable because, by design, 
they have no hidden or implicit loop dependence. Indeed, it is common to design 
array classes so that optimization is reasonably straightforward — this is clearly 
stated, for example, for valarray in the G-l — h standard library. Such statements 
vectorize well, but our focus is on cache-based architectures because they are 
increasingly common in both large- and small-scale parallel machines. 

An example of an A-| — I-/P-I— I- array statement implementing a stencil ope- 
ration is 

for (int n=0; n != N; n++) // Outer iteration 

A(I) = (A(I-1,J) + A(I+1,J) + A(I,J-1) + A(I,J+D) * 0.25; 

The statement may represent either a serial or parallel implementation of Jacobi 
relaxation. In the parallel case the array data represented by A is distributed in 
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some way across multiple processors and communication (to update the ghost 
boundary points along the edges of the partitioned data) is performed by the “=” 
operator. The syntax indicates that A denotes an array object of at least two di- 
mensions, and I and J denote either one-dimensional index or range objects — sets 
or intervals (respectively) of indexes. Thus the loops over I and J are implicit, 
as is distribution and communication in the parallel case. 

In this case the array is two-dimensional, with the first dimension ranging 
from 0 to SIZE_Y — 1, and the second ranging from 0 to SIZE_X — 1; and I and J 
denote range objects with indexes 1 through SIZE_Y — 2 inclusive and 1 through 
SIZE_X — 2 inclusive, respectively. 

The equivalent (serial) C code is 

for (int n=0; n!=N; n++) { // Outer iteration 

for (int j=l; j!=SIZE_Y-l; j++) //calculate new solution 
for (int i=l; i!=SlZE_X-l; i++) 

a_new[j] [i] = (a[j] [i-1] + a[j] [i+1] + 

a[j-l] [i] + a[j+l][i]) * 0.25; 
for (int j=l; j!=SlZE_Y-l; j++) //copy new to old 
for (int i=l; i!=SlZE_X-l; i++) 
a[j] [i] = a_new[j][i]; 

} 



3 Reducing Communication Overhead 

Tests on a variety of multiprocessor configurations, including networks of work- 
stations, shared memory, DSM, and distributed memory, show that the cost (in 
time) of passing a message of size fV, cache effects aside, is accurately mode- 
led by the function L + CN, where L is a constant per-message latency, and 
C is a cost per word. This suggests that message aggregation — lumping several 
messages into one — can improve performance^ 

In the context of stencil-like operations, message aggregation may be achieved 
by widening the ghost cell widths. In detail, if the ghost cell width is increased 
to three, using A and B as defined before, A[0. .99,0. .52] resides on the first 
processor and A [0 . .99,48. . 99] on the second. To preserve the semantics of the 
stencil operation the second index on the first processor is 1 to 51 on the first 
pass, 1 to 50 on the second pass, and 1 to 49 on the third pass, and similarly 
on the second processor. Following three passes, three columns of A on the first 
processor must be updated from the second, and vice versa. This pattern of 
access is diagrammed in Fig. 1. 

Clearly there is a tradeoff of computation for communication overhead. In 
real-world applications the arrays are often numerous but small, with communi- 
cation time exceeding computation time, and the constant time L of a message 
exceeding the linear time CN . Experimental results for a range of problem sizes 
and number of processors is given in Figs. 2 and 3. 

^ Cache effects are important but are ignored in such simple models. 
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Fig. 1. Pattern of access and message passing for ghost boundary width three 



Additional gains may be obtained by using asynchronous (non-blocking) mes- 
sage passing, which allows computation to overlap communication. Here the 
computation involving the ghost boundaries and adjacent columns is performed 
first, communication initiated, then interior calculations performed. Widening 
the ghost boundaries and so allowing multiple passes over the arrays without 
communication decreases the ratio of communication time to computation time; 
when the ratio is reduced to one or less communication time is almost entirely 
hidden. 

4 Temporal Locality, Cache Reuse, and Cache Blocking 

Temporal locality refers to the closeness in time, measured in the number of 
intervening memory references, between a given pair of memory references. Of 
concern is the temporal locality of references to the same memory location — if 
sufficiently local the second reference will be resolved by accessing cache rather 
than main or non-local shared memory. 

A cache miss is compulsory when it results from the first reference to a parti- 
cular memory location — no ordering of memory references can eliminate a com- 
pulsory miss. A capacity miss occurs when a subsequent reference is not resolved 
by the cache, presumably because it has been flushed from cache by intervening 
memory references. Thus the nominal goal in maximizing cache utilization is 
to reduce or eliminate capacity misses. We do not address the issue of conflict 
misses: given that a cache is associative and allowing a small percentage of the 
cache to remain apparently free when performing cache blocking, their impact 
is has proven unimportant. For architectures where their impact is significant 
various solutions exist cm. 
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Fig. 2. Message aggregation: improvement 
as a function of problem size and ghost cell 
width 



Fig. 3. Message aggregation: improvement 
as a function of number of processors and 
ghost cell width 



To give an example of the relative speeds of the various levels of the memory 
hierarchy, on the Origin 2000 — the machine for which we present performance 
data — the cost of accessing LI cache is one clock cycle; 10 clock cycles for L2, 80 
clock cycles for main memory, and for non-local memory 120 clock cycles plus 
network and cache-coherency overhead. 

A problem with loops that multiply traverse an array (as in the given code 
fragment) is that when the array is larger than the cache, the data cycles repea- 
tely through the cache. This is common in numerical applications, and stencil 
operations in particular. Cache blocking seeks to increase temporal locality by 
re-ordering references to array elements so that small blocks that fit into cache 
undergo multiple traversals without intervening references to other parts of the 
array. 

We distinguish two kinds of cache blocking: blocking done by a compiler 
(also called tiling) which we will refer to as compiler blocking, and our more 
effective technique, which we call temporal blocking, depicted in Fig. 3. In the 
case of e.g. stencil operations, a compiler won’t do the kinds of optimizations 
we propose because of the dependence between outer iterations. A compiler may 
still perform blocking, but to a lesser effect. For both, the context in which 
the transformation may be applied is in sweeping over an array, typically in a 
simple regular pattern of access visiting each element using a stencil operator. 
Such operations are a common part of numerical applications, including more 
sophisticated numerical algorithms (e.g. multigrid methods). What we describe 
is independent of any particular stencil operator, though the technique becomes 
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Fig. 4. Pattern of access for compiler blocking versus temporal blocking 



more complex for higher-order operators because of increased stencil radius. 
Temporal blocking is also applicable to a loop over a sequence of statements. 

5 The Temporal Blocking Algorithm 

The basic idea behind the algorithm is that of applying a stencil operator to 
an array, in place, generalized to multiple applications of the stencil (iterations 
over the array) in such a way that only one traversal is required in each of one 
or more dimensions. 

Consider first a stencil operator f(x,y,z) applied to a ID array A[0. .N]. 
Ignoring treatment at the ends, the body of the loop, for loop index variable i, 
is 



t = A[i] ; 

A[i] = f( u, A[i], A[i+1] ); 
u = t ; 

Here t and u are the temporaries that serve the role of the array of initial data 
(or the previous iteration’s values) for an algorithm that does not work in place. 
Next we generalize to n iterations. For three iterations the code is 

for (j=2; j!=-l; j — ) { 
t [j] = A[i+j] ; 

A[i+j] = f( u[j], A[i+j], A[i+j+l] ); 
u[j] = t[j] ; 

> 

First we observe that the ‘window’ into the array — here A [i . . i+3] , may be as 
small as the stencil radius plus the number of iterations, as is the case here. 
Second, at the cost of slightly greater algorithmic complexity, so saving space 
but with small cost in time, only one temporary array is required, of length one 
greater than the minimum size of the window, rather than twice the minimum 
size of the window. 
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Fig. 5. Stencil operation and temporary storage for ID decomposition of a 2D problem 



It is this window into the array that we wish to be cache-resident. It may be 
any size greater than the minimum (the temporary storage requirements do not 
change); for our performance experiments the various arrays are sized so that 
they nearly fill the LI cache. 

Another observation is that given an n-dimensional problem with an m- 
dimensional decomposition, this technique may be applied with respect to any 
subset of the m dimensions of the decomposition — the more common and more 
simply coded multiple-traversal and/or old-new approximations approach ap- 
plied to the remaining dimensions. The goal is to make all applications of the 
stencil operator to any given data element for a single cache miss (the compul- 
sory one) for that element, which indicates that for larger problem size (relative 
to cache size) the technique must be applied with respect to a larger number of 
dimensions. Figure 5 depicts the stencil operation and temporary storage for a 
ID decomposition of a 2D problem. 

6 Performance Analysis 

It is possible to predict the number of cache misses generated by the Jacobi 
relaxation code. In the case of compiler blocking the first sweep through the 
array should generate a number of cache misses equal to the number of elements 
accessed; these are compulsory. Each subsequent sweep will generate the same 
number of capacity misses. For temporal blocking only the compulsory misses 
should be generated. Experimental results shown in Fig. 6 bear this out. The 
data were collected on a MIPS RIOOOO-based Silicon Graphics Inc. Origin 2000 
with 32K of primary data cache and 4M of secondary unified cache, using on- 
processor hardware performance monitor counters; the programs were compiled 
at optimization level 3 using the MIPSpro C-|— I- compiler. Fig. 7 contrasts the 
performance of compiler blocking and temporal blocking in terms of CPU cycles. 
The block size as well as the number of Jacobi iterations varies along the x-axis. 






114 



F. Bassetti, K. Davis, and D. Quinlan 



Problem Size: 1M points 




Block Size - Iterations 



Problem Size: 1M points 




Fig. 6. LI misses as a function of block Fig. 7 . CPU cycles as a function of block 
size/number of iterations size/number of iterations 



Figure 7 shows that the temporal blocking version is twice as fast as the 
compiler blocking version until the block size exceeds the size of primary ca- 
che, beyond which temporal blocking and compiler blocking generate a similar 
number of cache misses. As expected, temporal blocking yields an improvement 
in performance linear in the number of iterations so long as the various data 
associated with a particular block fit in cache. Figure 8 shows that there is an 
ideal block size (relative to cache size) — in terms of CPU cycles, blocks smal- 
ler than ideal suffer from the constant overhead associated with the sweep of a 
single block; blocks larger than ideal generate capacity misses. (The spikes are 
attributable to anomalies of the hardware counters that do not capture results 
absolutely free of errors.) 

The presence of multiple cache levels requires no special consideration: other 
experiments show optimization with respect to LI cache is all that is required, 
and optimization with respect to L2 only is not nearly as beneficial. 

For problem sizes exceeding the size of L2 (usually the case for meaningful 
problems), a straightforward implementation gives rise to a number of cache 
misses proportional to the number of iterations; with our transformation the 
number of misses is effectively constant for a moderate numbers of iterations. 

In all the experiments for which results are given the block size is the same 
as the number of iterations. In this implementation of temporal blocking the 
number of possible iterations is limited by the block size. However, in most cases 
the number of iterations is dictated by the numerical algorithm. The choice then 
becomes that of the best block size given a fixed number of iterations. For a 
Jacobi problem a good rule of thumb is to have the number of elements in the 
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transitory array be a small fraction of the number of elements that could fit in 
a block that fits in primary cache. 
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Number of Iterations 



Fig. 8. CPU cycles as a function of block Fig. 9. Relative predicted performance as- 
size for fixed number of iterations suming ideal cache behaviour 



The figures show, and Fig. 9 makes clear, that achieved performance impro- 
vement is not as good as predicted — performance improves with the number of 
iterations, but never exceeds a factor of two. The figure shows that the achieved 
miss behavior is not relatively constant, but depends on the number of iterati- 
ons. The test code currently ignores the fact that permanent residency in cache 
for the transitory array cannot be guaranteed just by ensuring that there is al- 
ways enough cache space for all the subsets of the arrays. Different subsets of 
the other arrays can map to the same locations in cache as does the transitory 
array, resulting in a significant increase in the number of conflict misses; this is a 
recognized problem with cache behavior; a solution is suggested in PH. Various 
approaches are being evaluated under the assumption that there is still room for 
improvement before reaching some physical architecture-dependent limitations. 

6.1 Message Aggregation 

The message aggregation optimization is of relevance in the parallel context. It 
is important to point out that it would not be possible to enable the temporal 
blocking without the message aggregation. Thus, from a performance point of 
view the first expection is to not lose performance when using message aggrega- 
tion. At this stage of investigation we believe that temporal blocking as a stand 
alone optimization has a greater impact on the overall performance than just 
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message aggregation. The reasons in support of this are affected also by imple- 
mentation issues. Currently, the two optimizations are implemented and tested 
separately. While it is shown clearly by the result collected how performance 
is improved by the use of temporal blocking, for the message aggregation the 
performance improvement might appear a little obscure at first glance. 

Looking at the performance data obtained for message aggregation we first 
observe that there is an overall improvement in performance when the problem 
size is fairly large (i.e. doesn’t fit in cache once decomposed among the proces- 
sors) as shown in Fig. 3. The data in the figures where collected using a SGI 
Origin 2000 with 64 processors each with a secondary cache of 4 Mb. In all the 
runs both processors available on each node were used. The measurements were 
taken using the machine in a shared mode, therefore affected by other users’ 
activities. 

Performance data in Fig. 3 show improvement when the width of the bound- 
aries is increased, reducing in this way the amount of communication needed. 
The chart shows improvement when aggregation is used, but the trend is not 
neat. First of all we have to factor in the temporal blocking. Message aggregation 
introduces redundant computation, without an efficient way of doing the com- 
putation once the boundaries have been exchanged has an impact. In particular, 
without a caching strategy when boudaries are large enough their aggregation 
might worsen the caching since a larger quantity of data needs to be accommo- 
dated in a local processor’s cache. This translates into a reduction of the overall 
improvement. The purpose of the results presented in Figure 2 supports this 
intuition. A small problem size that always fits in cache has been used varying 
the size of the boundaries, but just on two processors (similar results can be 
obtained for a larger number of processor with the only difference being the pro- 
blem size). The performance data show larger improvements with a more clear 
pattern. With this data the potential of the message aggregation optimization 
is more clear. When computation and communication are of roughly the same 
order this technique enable reduces the communication overhead which translate 
in a bigger potential for hiding the communication costs. 

In this work we have only presented data obtained using an SGI Origin 2000 
system. Origin systems have a very good network topology as well as very good 
latency and bandwidth. In particular, for a neighbor-to-neighbor communication 
pattern Origin systems perform particularly well even without message aggre- 
gation. In the test codes used the communication costs are negligible when the 
processors involved in the exchange are in the same node. It is clear that when 
data need to be exchanged with more processors that the optimization propo- 
sed will have a greater impact. Preliminary results obtained on a networked 
collection of Sun workstations support this claim. 



7 Automating the Transformation 

An optimizing transformation is generally only of academic interest if it is not 
deployed and used. In the context of array classes, it does not appear possible 
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to provide this sort of optimization within the library itself because the appli- 
cability of the optimization is context dependent — the library can’t know how 
its objects are being used. Two mechanisms for automating such optimizations 
are being actively developed: the use of expression templates (e.g. in POOMA), 
which seems too limited; and a source-to-source tranformational system (a pre- 
processor), which we are currently developing. 

The ROSE preprocessor is a mechanism for (C-|— 1-) source-to-source transfor- 
mation, specifically targetted at optimizing the use of statements manipulated 
array class objects. It is based on the Sage II C-|— I- source code restructuring 
tools and provides a distinct (and optional) step in the compilation process. It 
recognizes the use of the A-I--I-/P-I— I- array class objects, and is ‘hard- wired’ with 
(later parameterized by) the A-|— I-/P-I--I- array class semantics, so obviating the 
need for difficult or impossible program analysis. It is also parameterized by 
platform properties such as cache size. There is in principle no limit (within the 
bounds of computability) on the types of transformations that can be performed 
using this mechanism. 



8 Conclusions 

Previous work has focused on the optimization of the array class libraries them- 
selves, and the use of techniques such as expression templates to provide better 
performance than the usual overloaded binary operators. We posit that such 
approaches are inadequate, that desirable optimizations exist that cannot be 
implemented by such methods, and that such approaches cannot reasonably be 
expected to be implemented by a compiler. One such optimization for cache 
architectures has been detailed and demonstrated. 

A significant part of the utility of this transformation is in its use to optimize 
array class statements (a particularly simple syntax for the user which hides the 
parallelism, distribution, and communication issues) and in the delivery of the 
transformation through the use of a preprocessing mechanism. 

The specific transformation we introduce addresses the use of array state- 
ments or collections of array statements within loop structures, thus it is really 
a family of transformations. For simplicity, only the case of a single array in a 
single loop has been described. Specifically, we evaluate the case of a stencil ope- 
ration in a for loop. We examine the performance using the C-|— I- compiler, but 
generate only C code in the transformation. We demonstrate that the temporal 
blocking transform is two times faster than the standard implementation. 

The temporal blocking transformation is language independent, although we 
provide no mechanism to automate the transformation outside of the Overture 
object-oriented framework. The general approach could equally well be used with 
FORTRAN 90 array syntax. 

Finally, the use of object-oriented frameworks is a powerful tool, but limited 
in use by the performance being less than that of FORTRAN 77; we expect 
that work such as this to change this situation, such that in the future one will 
use such object-oriented frameworks because they represent both a higher-level. 
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simpler, and more productive way to develop large-scale applications and a hig- 
her performance development strategy. We expect higher performance because 
the representation of the application using the higher level abstractions per- 
mits the use of new tools (such as the ROSE optimizing preprocessor) that can 
introduce more sophisticated transformation (because of their more restricted 
semantics) than compilers could introduce (because of the broader semantics 
that the complete language represents). 
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Abstract. This paper presents a Java-based software infrastructure that allows 
the merging of Web-based metacomputing with cluster-based parallel 
computing. We briefly compare the two approaches and we describe the 
implementation of a software bridge that supports the execution of meta- 
applications: some processes of the application may run over the Internet as 
Java applets while other processes of the same application can execute on a 
dedicated cluster of machines running PVM or MPI. We- present some 
performance results that show the effectiveness of our approach.** 



1. Introduction 

One decade ago the execution of parallel problems was dominated by the use of 
supercomputers, vector computers, multiprocessor and shared-memory machines. In 
this past decade there has been an increasing trend towards the use of Network of 
Workstations (NOWs) [1]. Another concept that recently has become popular is Web- 
based metacomputing [2]. The idea is to use geographically distributed computers to 
solve large parallel problems with the communication done through the Internet. In 
practice, the idea behind Weh-based parallel computing is just a new variation over 
NOW-based computing: the recycling of idle CPU cycles in the huge amount of 
machines connected to the Internet. 

Both approaches have their domain of application: cluster-based computing is used 
to execute a parallel problem that is of interest of some institution or company. The 
parallel problem is typically of medium-scale and the network of computers usually 
belongs to the same administrative domain. Cluster-hased computing is now well 
established and libraries like PVM [3] and MPI [4] have been widely used by the HPC 
community. On the other hand, Web-based computing is more suited for the execution 
of long-running applications that have a global interest, like solving some problems of 
cryptography, mathematics and computational science. It uses computing resources 
that belong to several administrative domains. Web-based computing is not widely 
used, though there are already some successful implementations of this concept. 
Examples include the Legion project [5], Globus [6], Charlotte [7], Javelin [8] among 
others. 
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In this paper, we present a software infrastructure that allows the simultaneous 
exploitation of both approaches. The system was developed in Java and makes part of 
the JET project [9]. Originally, the JET system provided a model of execution based 
on Internet computing: the applications are executed by Java applets and are 
downloaded through a standard Web-browser by the user who wants to volunteer 
some CPU cycles of his machine to a global computation. Lately, we have included 
the necessary features to support the execution of applications in cluster of computers 
running well-established communication libraries like MPI and PVM. These two 
libraries have been enhanced with a Java interface and we have developed a module 
that glues both models of computation. This module is called JET-Bridge and will be 
described in this paper. 

The rest of the paper is organized as follows, section 2 presents the functional 
architecture of the JET system. Section 3 describes in some detail the JET-Bridge. 
Section 4 presents some performance results. Section 5 concludes the paper. 



2. A General Overview of the JET Project 

JET is a software infrastructure that supports parallel processing of CPU-intensive 
problems that can be programmed in the MasterAVorker paradigm. There is a Master 
process that is responsible for the decomposition of the problem into small and 
independent tasks. The tasks are distributed among the Worker processes that execute 
a quite simple cycle: receive a task, compute it and send the result back to the Master. 
The Master is responsible for gathering the partial results and to merge them into the 
problem solution. Since each task is independent, there is no need for communication 
between worker processes. 

The Worker processes execute as Java applets inside a Web browser. The user that 
wants to volunteer his spare CPU cycles to a JET computation just need to access a 
Web page by using a Java-enabled browser. Then, she just has to click somewhere 
inside the page and one Worker Applet is downloaded to the client machine. This 
Applet will communicate with a JET Master that executes on the same remote 
machine where the Web page came from. 

The communication between the worker applets and the JET Master is done through 
UDP sockets. This class of sockets provides higher scalability and consumes fewer 
resources than TCP sockets. The UDP protocol does not guarantee the delivery of 
messages but the communication layer of JET implements a reliable service that 
insures sequenced and error-free message delivery. The library keeps a time-out 
mechanism for every socket connection in order to detect the failure or a withdrawn of 
some worker applet. 

The JET system provides some internal mechanisms to tolerate the high latency of 
the communication over the Internet. Those techniques are based on the prefetching of 
tasks by the remote machines and the asynchronous flush of output results back to the 
JET Master. There are some internal threads that perform the communication in a 
concurrent way with the normal execution of the application processes. 
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The number of machines that can join a JET computation is surely unpredictable but 
the system should be able to manage hundreds or thousands of clients. To assure a 
scalable execution we depart from the single-server approach and the forthcoming 
version of JET relies in a hierarchical structure of servers, as represented in Eigure 1. 




Fig. 1. The Structure of the JET virtual machine. 

This scalable structure relies in multiple JET Masters: every Master will be 
responsible for a sub-set of worker machines dividing the load more evenly and 
increasing the reliability of the system. Every JET Master communicates with a 
centralized JET Server, which maintains the global status of the application and a 
database with all the interesting statistics. 

The JET system includes some fault-tolerance mechanisms. Task reconfiguration is 
used to tolerate the loss of worker applet. The resiliency of the Master processes is 
achieved through the use of checkpointing and logging techniques. The checkpointing 
mechanism is necessary to assure the continuity of the application when there is a 
failure or a preventive shutdown of a JET Master or the main Server. The critical state 
of the application is saved periodically in stable storage in some portable format that 
allows its resumption later on in the same or in a different machine. 

JET computations will not be restricted to the Web. It is possible to use some other 
existing high-performance computing resources, like cluster of workstations or 
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parallel machines. The basic idea is to allow existing clusters of machines running 
PVM or MPI to inter-operate with a JET computation. To achieve this interoperability 
we have implemented a Java interface to the Windows version of MPI that was 
developed by our research group [10]. The Java binding is described in [11]. We have 
also ported a Java interface [12] to our implementation of PVM, called WPVM [13]. 
The next section presents the JET-B ridge, a software module that allows the 
integration of JET with PVM/MPI applications. 



3. The JET-Bridge 

The functioning of the JET-Bridge assumes that the applications that will execute in 
the cluster side elect one of the processes as the Master of the cluster. Usually this is 
the process with rank 0. The Master process is the only one that interacts with the 
JET-Bridge. Inside the cluster the application may follow any programming paradigm 
(SPMD, Task-Farming or Pipelining) although we have only been used the JET- 
Bridge with Task-Farming applications. 

The Master process of a PVM/MPI cluster needs to create an instance of an object 
(JetBridge) that implements a bridge between the cluster and the JET Master. This 
object is responsible for all the communication with the JET Master. The Master 
process from a PVM/MPI cluster gets some set of jobs from the JET Master, and 
maintains them in an internal buffer. These jobs are then distributed among the 
Workers of the cluster. Similarly, the results gathered by the PVM/MPI Master 
process are placed in a separate buffer and will be sent later to the JET Master. This 
scheme is represented in Figure 2. 

The Master is the only process of the cluster that connects directly with the JET 
machine. This process is the only one that needs to be written in Java. The Worker 
processes can be implemented in any of the languages supported by WMPI/WPVM 
libraries (i.e. C, Fortran, and Java) and all the heterogeneity is solved by using the 
Java bindings [11]. 

When the user creates an instance of the object JetBridge two new threads are 
created; the Sender thread and the Receiver thread. These threads are responsible for 
all the communication with the JET Master. 
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Fig. 2. Interoperability of JET with PVM/MPI clusters. 

3.1 Registering to the JET Machine 

To connect to the JET machine the WPVM or WMPI application should call the 
method j Start () . This method performs the registry with the JET machine. The 
JET Server creates a service entry for that cluster and assigns an identification key that 
is sent hack to the WPVM or WMPI Master process (rank 0). After this registration 
process the cluster is willing to participate in a JET computation. 

3.2 Obtaining Jobs to Compute 

To mask the latency of the network and exploit some execution autonomy the JET- 
Bridge performs some job prefetching from the JET Server. A set of jobs is grabbed 
from the server and they are placed in an internal pool of jobs residing in the machine 
where the WPVMAVMPl Master process is running. Later on, these jobs are 
distributed among the Worker processes of the cluster. The number of jobs that are 
prefetched each time depends on the size of the cluster. When the JET-Bridge detects 
that the number of jobs to compute is less than the number of Workers it sends another 
request to the JET Server asking for a new bunch of jobs. 

3.3 Distribution of Jobs 

The Master of the cluster is responsible for the distribution of the jobs among the rest 
of the application processes. The Master should make a call to the method j Get ( ) to 
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take a job from the internal pool of jobs. All the jobs have a unique identifier and this 
number is automatically piggybacked to the corresponding result packet. 

3.4 Delivering of Results 

When the Master of the cluster wants to deliver a computed result to the JET Server it 
should call the method j Put ( ) . The results are stored internally in a pool of results 
and are later delivered to the JET Master. Thereby, a set of results is merged into a 
single message to avoid intensive communication with the JET Server. 

3.5 Management of Clusters 

The JET Server treats a MPI/PVM cluster as a single Worker, although with some 
differences: first, the communication is through TCP sockets while in Web-computing 
JET uses UDP sockets. The message exchanging is also done differently: while a 
Worker Applet receives a job from the Master, computes it and sends back the result, 
the proxy of the cluster (i.e. the Master) exchange several jobs and the results are 
merged in a same message. 

To manage the participating clusters that join a JET computation we introduced two 
new threads in the JetMaster. A SenderTCP thread and a ReceiverTCP thread. When 
the SenderTCP thread wants to send some jobs to a cluster worker, it establishes a 
TCP connection, sends the jobs and closes the connection. At the other side, when a 
cluster wants to deliver some results it also establishes a connection, sends the results 
and closes the connection. We do not maintain the connection with each cluster proxy 
to avoid the exhaustion of system resources. 

The JET-Bridge infrastructure is totally independent from the underlying 
application. In the future, if we want to port other well-known parallel libraries to 
Java, it will be easy to integrate them with the JET platform. 



4. Performance Results 

In this section we present some results of an experimental study that show the 
effectiveness of the JET-Bridge together with the Java bindings that we have 
implemented. All the measurements were taken with the NQueens benchmark with 
14 queens in a cluster of Pentiums 200MHz running Windows NT 4.0, which are 
connected through a non-dedicated 10 Mbit/sec Ethernet. 

Eigure 3 presents several different combinations of heterogeneous configurations 
of this benchmark using WMPI and the JET platform. JWMPI corresponds to the case 
where the application was written in Java and used the WMPI library. In CWMPI the 
application written in C, while JWMPI (Native) represents a hybrid case: the 
interface to the WMPI was written in Java but the real computation was done by a 
native method written in C. 



The first three experiments present the results of computations using homogeneous 
clusters of 8 processes. As we can see from the Eigure the two computations that used 
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native code (i.e. C) executed in about half a time of the pure Java version. This is due 
to the difference in performance presented by the languages. Until the Java compiler 
technology reaches maturity, the use of native code in Java programs is possible a way 
to improve performance. Providing access to standard libraries, often required in 
scientific programming, seems imperative in order to allow the reuse of existing code 
that was developed with MPI and PVM. 




Fig. 3. Performance results of heterogeneous configurations using WMPI. 

In our implementation of the NQueens benchmark the jobs are distributed on 
demand, allowing the faster workers to compute more jobs than the slower ones. So, 
the best performance is obtained in all the computations that include processes entirely 
written in C or those hybrid processes (Java-tC) that use the native version of the 
kernel. 

More than the absolute results, this experiment has proved the importance of the 
JET-Bridge and the Java bindings, which allow us to exploit the potential of a really 
heterogeneous computation. Where some processes were executing as Java Applets, 
others may execute in Java and use the WMPI library. They can also interoperate in 
the same application with other processes written in C or that use a hybrid approach: 
Java and native code. 

5. Conclusions 

The system presented in this paper provides an integrated solution to unleash the 
potential of diverse computational resources, different languages and different 
execution models. With a software module like the JET-Bride it would be possible to 
execute some applications where part of the tasks are executed over the Internet while 
other tasks are computed on a dedicated PVM or MPI cluster-based platform. 
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Moreover, some of the processes of the application are executed as Java Applets, 
other processes are written in C; others may be written in Java and use native or full 
Java code. The cluster processes can also chose between the MPI and the PVM API. 
There could be some meta-applications that would exploit this potential of 
interoperability. 

The JET Bridge was implemented with socket communication and made use of the 
JNI interface. The interoperability between the different languages could have been 
developed with CORBA, although it should be expected some decrease in the 
performance. 
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Abstract. Metacomputing frameworks have received renewed attention 
of late, fueled both by advances in hardware and networking, and by 
novel concepts such as computational grids. However these frameworks 
are often inflexible, and force the application into a hxed environment 
rather than trying to adapt to the application’s needs. Harness is an ex- 
perimental metacomputing system based upon the principle of dynamic 
reconhgurability not only in terms of the computers and networks that 
comprise the virtual machine, but also in the capabilities of the VM itself. 
These characteristics may be modified under user control via a “plug- 
in” mechanism that is the central feature of the system. In this paper 
we describe how the design of the Harness system allows the dynamic 
configuration and reconfiguration of virtual machines, including naming 
and addressing methods, as well as plug-in location, loading, validation, 
and synchronization methods. 



1 Introduction 

Harness is an experimental metacomputing system based upon the principle of 
dynamically reconfigurable networked computing frameworks. Harness supports 
reconfiguration not only in terms of the computers and networks that comprise 
the virtual machine, but also in the capabilities of the VM itself. These charac- 
teristics may be modified under user control via a “plug-in” mechanism that is 
the central feature of the system. The motivation for a plugin-based approach to 
reconfigurable virtual machines is derived from two observations. First, distribu- 
ted and cluster computing technologies change often in response to new machine 
capabilities, interconnection network types, protocols, and application require- 
ments. For example, the availability of Myrinet [[IJ] interfaces and Illinois Fast 
Messages has recently led to new models for closely coupled Network Of Work- 
stations (NOW) computing systems. Similarly, multicast protocols and better 
algorithms for video and audio codecs have led to a number of projects that fo- 
cus on tele-presence over distributed systems. In these instances, the underlying 
middleware either needs to be changed or re-constructed, thereby increasing the 
effort level involved and hampering interoperability. A virtual machine model 
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intrinsically incorporating reconfiguration capabilities will address these issues 
in an effective manner. The second reason for investigating the plug-in model is 
to attempt to provide a virtual machine environment that can dynamically ad- 
apt to meet an application’s needs, rather than forcing the application to fit into 
a fixed environment. Long-lived simulations evolve through several phases: data 
input, problem setup, calculation, and analysis or visualization of results. In tra- 
ditional, statically configured metacomputers, resources needed during one phase 
are often underutilized in other phases. By allowing applications to dynamically 
reconfigure the system, the overall utilization of the computing infrastructure 
can be enhanced. 

The overall goals of the Harness project are to investigate and develop three 
key capabilities within the framework of a heterogeneous computing environ- 
ment: 

— Techniques and methods for creating an environment where multiple distri- 
buted virtual machines can collaborate, merge or split. This will extend the 
current network and cluster computing model to include multiple distributed 
virtual machines with multiple users, thereby enabling standalone as well as 
collaborative metacomputing. 

— Specification and design of plug-in interfaces to allow dynamic extensions to 
a distributed virtual machine. This aspect involves the development of a ge- 
neralized plug-in paradigm for distributed virtual machines that allows users 
or applications to dynamically customize, adapt, and extend the distributed 
computing environment’s features to match their needs. 

— Methodologies for distinct parallel applications to discover each other, dyna- 
mically attach, collaborate, and cleanly detach. We envision that this capa- 
bility will be enabled by the creation of a framework that will integrate disco- 
very services with an API that defines attachment and detachment protocols 
between heterogeneous, distributed applications. 

In the preliminary stage of the Harness project, we have focused upon the 
dynamic configuration and reconfiguration of virtual machines, including na- 
ming and addressing schemes, as well as plugin location, loading, validation, 
and synchronization methods. Our design choices, as well as the analysis and 
justifications thereof, and preliminary experiences, are reported in this paper. 

2 Architectural Overview of Harness 

The architecture of the Harness system is designed to maximize expandability 
and openness. In order to accommodate these requirements, the system design 
focuses on two major aspects: the management of the status of a Virtual Machine 
that is composed of a dynamically changeable set of hosts; the capability of 
expanding the set of services delivered to users by means of plugging into the 
system new, possibly user defined, modules on-demand without compromising 
the consistency of the programming environment. 




Dynamic Reconfiguration and Virtual Machine Management 



129 




Fig. 1. A Harness virtual machine 



2.1 Virtual Machine Startup and Harness System Requirements 

The Harness system allows the definition and establishment of one or more 
Virtual Machines (VMs). A Harness VM (see Fig. is a distributed system 
composed of a VM status server and a set of kernels running on hosts and 
delivering services to users. 

The current prototype of the Harness system implements both the kernel 
and the VM status server as pure Java programs. We have used the multithrea- 
ding capability of the Java Virtual Machine to exploit the intrinsic parallelism 
of the different tasks the programs have to perform, and we have built the sy- 
stem as a package of several Java classes. Thus, in order to be able to use the 
Harness system a host should be capable of running Java programs (i.e. must 
be JVM equipped). The different components of the Harness system communi- 
cates through reliable unicast channels and unreliable multicast channels. In the 
current prototype these communication commodities are implemented using the 
java.net package. 

In order to use the Harness system, applications should link to the Harness 
core library. The basic Harness distribution will include core library versions for 
C, C-| — h and Java programs but in the following description we show only Java 
prototypes. 

This library provides access to the only hardcoded service access point of the 
Harness system, namely the core function 

Object H_conunand (String VMSymbolicName , StringG theCommarid) . 

The first argument to this function is a string specifying the symbolic name of 
the virtual machine the application wants to interact with. The second argument 
is the actual command and its parameters. The command might be one of the 
User Kernel Interface commands as defined later in the paper or the registerUser 
command. The return value of the core function depends on the command issued. 
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In the following we will use the term user to mean a user that runs one or 
more Harness applications on a host, and we will use the term application to 
mean a program willing to request and use services provided by the Harness 
system. 

Any application must register via registerUser before issuing any command 
to a Harness VM. Parameters to this command are userName and userPassword; 
this call will set a security context object that will be used by the system to check 
user privileges. When the registration procedure is completed the application can 
start issuing commands to the Harness system interacting with a local Harness 
kernel. 



tiusi s 

IT i T- i 

Kernels 
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Fig. 2. Event sequence for a distributed plug-in loading 



A Harness kernel is the interface between any application running on a host 
and the Harness system. Each host willing to participate in a Harness VM runs 
one kernel for each VM. The kernel is bootstrapped by the core library during the 
user registration procedure. A Harness kernel delivers services to user programs 
and cooperates with other kernels and the VM status server to manage the VM. 
The status server acts as a repository of a centralized copy of the VM status 
and as a dispatcher of the events that the kernel entities want to publish to the 
system (see Fig.EJ- Each VM has only one status server entity in the sense that 
all the other entities (kernels) see it as a single monolithic entity with a single 
access point. Harness VM’s use a built-in communication subsystem to distribute 
system events to the participating active entities. Applications based on message 
passing may use this substrate or may provide their own communications fabric 
in the form of a Harness plug-in. In the prototype, native communications use 
TCP and UDP/IP-multicast. 
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2.2 Virtual Machine Management: Dynamic Evolution of a Harness 
VM 

In our early prototype of Harness, the scheme we have developed for maintai- 
ning the status of a Harness VM is described below. The status of each VM is 
composed of the following information: 

— Membership: the set of participating kernels; 

— Services: the set of services that, based on the plug-in modules currently 
loaded, the VM is able to perform both as a whole and on a per-kernel basis; 

— Baseline: the services that new kernels needs to be able to deliver to join the 
VM and the semantics of these services; 

It is important to notice that the VM status is kept completely separated from 
the internal status of any user application in order to prevent its consistency 
protocol from constraining users’ applications requirements. 

To prevent the status server from being a single point of failure, each VM 
in the Harness system keeps two copies of its status: one is centralized in the 
status server and the second collectively maintained among the kernels. This 
mechanism allows reconstruction of the status of each crashed kernel from the 
central copy and, in case of status server crash, reconstructing the central copy 
from the distributed status information held among the kernels. 

Each Harness VM is identified by a VM symbolic name. Each VM symbolic 
name is mapped onto a multicast address by a hashing function. A kernel trying 
to join a VM multicasts a “join” message on the multicast address obtained 
by applying the hashing function to the VM symbolic name. The VM server 
responds by connecting to the inquiring kernel via a reliable unicast channel, 
checking the kernel baseline and sending back either an acceptance message or 
a rejection message. All further exchanges take place on the reliable unicast 
channel. To leave a VM a kernel sends a “leave” message to the VM server. The 
VM server publishes the event to all the remaining kernels and updates the VM 
status. Every service that each kernel supports is published by the VM status 
server to every other kernel in the VM. This mechanism allows each kernel in a 
Harness VM to define the set of services it is interested in and to keep a selective 
up-to-date picture of the status of the whole VM. Periodic “I’m alive” messages 
are used to maintain VM status information; when the server detects a crash, 
it publishes the event to every other kernel. If and when the kernel rejoins, the 
VM server gives it the old copy of the status and wait for a new, potentially 
different, status structure from the rejoined kernel. The new status is checked 
for compatibility with current VM requirements. A similar procedure is used to 
detect failure of the VM server and to regenerate a replacement server. 

2.3 Services: The User Interface of Harness Kernels 

The fundamental service delivered by a Harness kernel is the capability to ma- 
nipulate the set of services the system is able to perform. The user interface of 
Harness kernels accepts commands with the following general syntax: 
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<command> <locator> <targets> 

<Quality of Service> [additional paramieters] 

The command field can contain one of the following values: 

— Load to install a plug-in into the system; 

— Run to run a thread to execute plug-in code; 

— Unload to remove an unused plug-in from the system; 

— Stop to terminate the execution of a thread 

Services delivered by plug-ins may be shared according to permission attri- 
butes set on a per plug-in basis. Users may remove only services not in the core 
category. A core service is one that is mandatory for a kernel to interact with 
the rest of the VM. With the stop and unload commands a user can reclaim 
resources from a service that is no longer needed, but the nature of core services 
prevents any user from downgrading a kernel to an inoperable state. However, 
although it is not possible to change core services at run time, they do not repre- 
sent points of obsolescence in the Harness system. In fact they are implemented 
as hidden plug-in modules that are loaded into the kernel at bootstrap time and 
thus easily upgraded. The core services of the Harness system form the baseline 
and must be provided by each kernel that wishes to join a VM. They are: 

— The VM server crash recovery procedure; 

— The plug-in loader/linker module; 

— The core communication subsystem. 

Commands must contain the unique locator of the plug-in to be manipulated. 
The lowest level Harness locator, the one actually accepted by the kernel, is a 
Uniform Resource Locator (URL). However any user may load at registration 
time a plug-in module that enhances the resource management capabilities of 
the kernel by allowing users to adopt Uniform Resource Names (URNs), instead 
of URLs, as locators. The version of this plugin provided with the basic Harness 
distribution allows: 

— Checking for the availability of the plug-in module on multiple local and 
remote repositories (e.g. a user may simply wish to load the “SparseMatrix- 
Solver” plug-in without specifying the implementation code or its location); 

— The resolution of any architecture requirement for impure- Java plug-ins. 

However, the level of abstraction at which service negotiation and URN to URL 
translation will take place, and the actual protocol implementing this proce- 
dure, can be enhanced/changed by providing a new resource manager plug-in to 
kernels. 

The target field of a command defines the set of kernels that are required to 
execute the command. Every non-local command is executed using a two phase 
commit protocol. Each command can be issued with one of the following Quality 
of Service (QoS): all-or-none and best-effort. A command submitted with a all- 
or-none QoS succeeds if and only if all of the kernels specified in the target field 
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are able (and willing) to execute it. A command submitted with a best-effort 
QoS fails if and only if all the kernels specified in the target field are unable 
(unwilling) to execute it. Both the failure and the success return values include 
the list of kernel able (willing) to execute the command and the list of the unable 
(unwilling) ones. 



2.4 Related Work 



Metacomputing frameworks have been popular for nearly a decade, when the 
advent of high end workstations and ubiquitous networking in the late 80 ’s ena- 
bled high performance concurrent computing in networked environments. PVM 
0 was one of the earliest systems to formulate the metacomputing concept in 
concrete terms, and explore heterogeneous network computing. PVM however, 
is inflexible in many respects. For example, multiple DVM merging and split- 
ting is not supported. Two different users cannot interact, cooperate, and share 
resources and programs within a live PVM machine. PVM uses internet proto- 
cols which may preclude the use of specialized network hardware. A “plug-in” 
paradigm would alleviate all these drawbacks while providing greatly expanded 
scope and substantial protection against both rigidity and obsolescence. 

Legion 0 is a metacomputing system that began as an extension of the 
Mentat project. Legion can accommodate a heterogeneous mix of geographi- 
cally distributed high-performance machines and workstations. Legion is an ob- 
ject oriented system where the focus is on providing transparent access to an 
enterprise-wide distributed computing framework. 

The model of the Millennium system 0 being developed by Microsoft Rese- 
arch is similar to that of Legion’s global virtual machine. Logically there is only 
one global Millennium system composed of distributed objects. However, at any 
given instance it may be partitioned into many pieces. Partitions may be cau- 
sed by disconnected or weakly-connected operations. This could be considered 
similar to the Harness concept of dynamic joining and splitting of DVMs. 

Globus |5| is a metacomputing infrastructure which is built upon the Ne- 
xus |S| communication framework. The Globus system is designed around the 
concept of a toolkit that consists of the pre-defined modules pertaining to com- 
munication, resource allocation, data, etc. Globus even aspires to eventually 
incorporate Legion as an optional module. This modularity of Globus remains 
at the metacomputing system level in the sense that modules affect the global 
composition of the metacomputing substrate. 

The above projects envision a much wider-scale view of distributed resources 
and programming paradigms than Harness. Harness is not being proposed as a 
world-wide infrastructure, but more in the spirit of PVM, it is a small heteroge- 
neous distributed computing environment that groups of collaborating scientists 
can use to get their science done. Harness is also seen as a research tool for 
exploring pluggability and dynamic adaptability within DVMs. 
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3 Conclusions and Future Work 

In this paper we have described our early work on the plug-in mechanism and 
the dynamic Virtual Machine (VM) management mechanism of the Harness 
system, an experimental metacomputing system. These mechanisms allow the 
Harness system to achieve reconfigurability not only in terms of the computers 
and networks that comprise the VM, but also in the capabilities and the ser- 
vices provided by the VM itself, without compromising the coherency of the 
programming environment. 

Early experience with small example programs show that the system is able 
to: 



— Adapt to changing user needs by adding new services via the plug-in mecha- 
nism; 

— Safely add or remove services to a distributed VM; 

— Locate, validate and load locally or remotely stored plug-in modules; 

— Cope with network and host failure with a limited overhead; 

— Dynamically add and remove hosts to the VM via the dynamic VM mana- 
gement mechanism. 

In a future stage of the Harness project we will test these feature on real 
world applications. 
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JEM-DOOS: The Java’^/RMI Based Distributed 
Objects Operating System of the JEM Project** 
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Abstract. The Java technology 0 provides support to design and de- 
velop platforms to deal with heterogeneous networks. One of the goals 
of the JEM project, Experimentation environMent for Java, carried out 
at LaBRI is to design and develop such a platform. The JEM project |2] 
consists in: providing a distributed platform that makes using heteroge- 
neous networks of computers easier; using this platform as a laboratory 
for experimentation purpose. It is based on Java, RMipj and CORBA0. 
In this paper, we present an overview of the conception and the imple- 
mentation of the kernel of our platform. This kernel is called JEM-DOOS 
for JEM-Distributed Objects Operating System. Its inspiration owes a lot 
to POSIX0, especially to POSIX.l. We adapt the way this norm deals 
with file systems to deal with object systems, i.e. hierarchies of objects 
similar to POSIX hierarchies of files. In the current release, alpha 0.1, 
objects we have implemented provide access to system resources, such as 
processors, screens, etc. Furthermore, JEM-DOOS supports remote ac- 
cess to objects, which makes it distributed. Hence, JEM-DOOS provides 
a way to deal with heterogeneous objects in heterogeneous networks of 
computers. 



1 Introduction : The JEM Project 

The JEM project 13 carried out at LaBRI has two main aims. The first is to faci- 
litate the use and programming of distributed heterogeneous systems by means 
of the design and implementation of a distributed platform. Javaf^ technology 
makes it possible for us to handle heterogeneity, and CORBAg and RMI0 
technologies make it possible to deal with distribution, using remote method in- 
vocation and object transfer through the network. The current implementation 
is based on RMI. Future releases will also provide a CORBA interface. 

The second aim is to use the above platform as a laboratory to study, with 
other research teams, some difficult problems related to parallel and distributed 
technologies j2]: algorithmic debugging; threads and their use, their debugging, 

* Java and all Java-based marks are trademarks or registered trademarks of Sun Micro- 
systems, Inc. in the United States and other countries. The author is independent 
of Sun Microsystems, Inc. 

** This work is partly supported by the Universite Bordeaux I, the Region Aquitaine 
and the CNRS. 
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their modelization and the problem of distributed priorities; mobile agents, their 
modelization and the validation of applications using them. 

This paper describes the basis of our platform, JEM-DOOS, Experimentation 
environMent for Java - Distributed Ohjeets Operating System. Its conception is 
similar to that of POSIX. POSIX defines a standard operating system inter- 
face and environment based on the UNIX operating system documentation to 
support application portability at the source level|^|^. The aim of DOOS, our 
Distributed Objects Operating System, is to provide _objec10 handling in a way 
similar to file (and directory) handling provided by POSIX. Note that we do 
not claim conformance to POSIX; we just applied a POSIX-like methodology to 
the design of our system. An .object of our system can be any object comply- 
ing with a given interface (that we will define later in this paper). For instance 
in the current implementation we have developed .objects that give access to 
system resources through this standard interface. These .objects are Machine, 
ProcessorManager, Processor, Screen, etc. As a result, listing the contents of 
/net/alpha. labri.u-bordeaux.fr/processors/ under DOOS gives a list of 
the processors of alpha.labri.u-bordeaux.fr. 

The rest of this paper is organized as follows. We first give an overview 
of related work in Sect. 0 Section 0 presents the basic concepts of POSIX. 1 
related to files. Section 0 shows the design of DOOS compared to POSIX and 
Sect. 0 gives information about the current implementation compared to UNIX 
implementation of POSIX. Section 0 explains how Java/RMI technology makes 
DOOS a distributed system. We eventually provide an example application using 
DOOS in Sect. Q and conclude with future directions of the project in Sect.0 

2 Related Work 

Many research projects are being developed around Java and distributed/parallel 
technologies. In this section we present some of the most significant of these 
projects. 

Communication libraries. Communication libraries have been integrated into 
Java. jPVMp uses the possibility to interface native code with Java so as to 
make the PVM library available from Java. Another library, JPVM|H|,isaPVM- 
like implementation totally written in Java. This implementation has the advan- 
tage not to require PVM on the target machines, and consequently to provide for 
a greater level of portability of applications. Other projects, such as E0 {The E 
Extensions to Java), offer higher level extensions to basic Java communication 
mechanisms. 

Extensions for parallel and distributed programming. Some systems propose 
to add new constructs for parallel programming, like Java / / lUJI for instance, or 
to add frameworks for data-parallel programming like Do ! m- Other projects, 
like .Ta,va,Pa,rtv|1 offer both an execution support and new constructs to make 
easier the design and implementation of distributed Java applications. 

^ We use the notation .object to prevent any confusion between .objects handled by 
DOOS and objects in the usual meaning of object oriented technology. 
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Platforms. Meta,Tava|i:i| is a more general system. It relies on a meta level 
that makes it possible to manage a set of objects. It is based on a transfer 
of control to this meta level by events generation at the objects level. Hence, 
the meta level can coordinate, for instance in a distributed environment, the 
activities of the basic objects. Legion^^ is also a high level system. It provides 
a platform that integrates all the support necessary to manage a set of software 
and hardware distributed resources. 

Our platform mainly differs from Legion in that it proposes a common in- 
terface to all objects of the system. See HSl for further comparison to other 
systems. 



3 POSIX 

To handle a file from within an application POSIX defines four entities: a file, a 
file descriptor, an open file description and an open file. These definitions lead 
to an abstract model of a possible POSIX implementation as shown in Fig. d 



file descriptor open file description 



open file 



directory 



other file 
other file 
other file 
other file 



Fig. 1. A model of POSIX implementation and parenthood handling 



A file is an object that can be written to, or read from, or both. An open file is a 
file that is currently associated with a file descriptor. An open file description is 
a record of how a process or group of processes are accessing a file. An open file 
description contains information on how a file is being accessed, plus a handle to 
an open file. A file descriptor is a per-process unique, non negative integer used 
to identify an open file for the purpose of file access. A file descriptor references 
a structure that contains information plus a handle to an open file description. 
Most of the time the name file descriptor is used for both this integer and the 
structure it refers to. The reason why this integer is required is for inheritance 
purpose between processes. Since processes do not share the same address space, 
using a handle, i.e. an address, would prevent them from inheriting/sharing file 
descriptors. 

To ensure some sort of organization of files, POSIX provides the notion of 
directory (see Fig. [Ql. A directory is a file, the data of which are references to 
other files. 
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4 DOOS 

We explained in Sect. Q]that the architecture of DOOS is closely based on the 
POSIX file system architecture, but applied to objects: POSIX lOs deal with 
files, i.e. “dead” data stored on devices, DOOS lOs deal with “living” data, i.e. 
objects stored in memory. Hence, a DOOS _o&jecfl is an object. Because of this 
similarity of conception, DOOS relies on a set of basic definitions that can di- 
rectly be derived from those of POSIX. An open -object is an .object that is 
currently associated with an .object descriptoiQ. .objects composing an .object 
system can be anything. The sole constraint is that they show (by means of a 
presentation layer as shown Fig. the same open .object interface, that will 
make it possible to handle them through a set of basic generic operations. For 
instance, in the current implementation, we have developed .objects that provide 
access to effective system resources such as files, screens, machines, processors, 
etc. Of course, it is also possible to define .objects without relationship to system 
resources. An open -object description is a record of how a process or group of 
processes are accessing an .object. An open .object description contains infor- 
mation on how an .object is being accessed, plus a handle to an open .object. An 
-object descriptor is a per-process unique, non negative integer used to identify 
an open .object for the purpose of .object access. An .object descriptor refe- 
rences a structure that contains information plus a handle to an open .object 
description. Following POSIX, both the integer and the structure it refers to are 
called -object descriptor. 



_object descriptor open _object description 



attributes 

methods 



nextChildO 



-othw_object 



open _object 



_object 



Fig. 2. A model of DOOS implementation and parenthood handling 



DOOS provides a notion of parenthood between .objects. In the same manner 
as a directory can contain references to other files, an .object has a method to 
get handles to other .objects (see Fig. El). This feature provides a hierarchy of 
.objects in DOOS. 

^ Remember that we use the notation .object to prevent any confusion between 
.objects handled by DOOS and objects in the usual meaning of object oriented tech- 
nology. The distinction is not important for this precise definition, but will make 
sense in the rest of this paper. 

® When there is no possible confusion, we will use the term .object to talk about an 
open .object . 
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The interface of a DOOS .object defines the following three features. Naming: 
String getNcuneO: In the current release names are directly associated with 
.objects, not with parent .objects. Doing so, names can be built by .objects so 
as to describe the features they provide, e.g. the resource they interface (see 
Sect. 2J. Names are mainly used for user interface purpose, for instance when 
listing the children of a given .object (see Sect. 0. 

Input: void copyInCOpenObject object) (POSIX write): Any .object can 
be copied into any .object . No global semantics is given to this operation. It is 
up to each .object of the .object system to decide how it deals with an .object 
copied into it. For instance, in the current implementation, an .object copied 
to a Screen .object is simply printed out and when a File .object is copied into 
another File .object , its contents replaces the contents of the original one. 

Access to Children: The set of children of a given .object can vary during exe- 
cution. Therefore, three functions are provided to access them: startChildren 
(POSIX rewinddir), hasMoreChildren, and nextChild (POSIX readdir). We 
cannot use a standard Java Enumeration since it would prevent the .object from 
building its children dynamically while they are enumerated. 

This leads to the following Java interface for an open .object : 

public interface OpenObjectf 

public String getNameO throws labri . Jem. exception; 

[. . .] 

public void startChildrenO throws labri. jem. exception; 
public boolecui hasMoreChildrenO throws labri. jem. exception; 
public OpenObject nextChildO throws labri . jem. exception; 
public void copyln (OpenObject _object) throws labri . jem. exception; 
[. . .] 

> 

Most DOOS .objects are lightweight since the above operations are usually 
straightforward to implement. 

5 Low Level Implementation in Unix and DOOS 

Under Unix, the fact that data stored in a physical device are interpreted as a 
file system, i.e. as files and directories, is the responsibility of both a layer of the 
operating system (which we call presentation layer) and a device driver (see Fig. 
EJ. The presentation layer to be used depends on the device and file type. This 
layer is in charge of formatting physical data to provide the POSIX abstraction. 
Within this framework, a readdir operation could be implemented as follows 
(see Figs. 0 and 0 : 

1. [user ] call readdir (/ile descriptor fd) 

2. [system] access open file description to check access mode 

3. [system] access open file to get physical data location 

4. [system] get device driver and feed it with a read bytes request 
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device 

inode 





Peripheral 



Presentation Driver 
layer 



Fig. 3. A possible POSIX/UNIX implementation 



5. [driver] apply operation 

6. [system] format device driver answer depending on device and on file type 

7. [system] return result to user 

8. [user ] use the result 

In DOOS we do not need drivers as required in a UNIX implementation 
such as the one shown Fig. 0 The work which is done by a driver under UNIX 
mainly consists in transferring bytes between a physical device and the system. 
Under DOOS this is directly achieved by the Java Virtual Machine (Fig.0). For 
instance, accessing the name of an object o is done by accessing o.name, what 
is effectively done by the JVM. Within this framework, a nextChild operation 





_Object 



Presentation JVM 
layer 



Fig. 4. DOOS implementation 



(POSIX readdir) is implemented as follows (see Figs. |2| and 0: 

1. [user ] call nextChild(_object descriptor od) 

2. [system] access open -object description to check access 

3. [system] access open -object to get physical _object handle 

4. [system] invoke nextChild () method on presentation layer associated to 
.object 
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5. [system] return result to user 

6. [user ] use the result 

Each .object type is provided with a specific presentation layer (see Sect. EJ. 
This layer is in charge of showing the OpenObject interface to DOOS. 

6 The Distributed Architecture of DOOS 

DOOS provides a way to aggregate a set of .object systems into a single dis- 
tributed .object system. To do that it mainly relies on the possibility to have 
handles to remote objects. It is then enough to use remote handles instead of 
local handles in the structure presented in Fig. El These remote handles are 
implemented using the RMI framework provided by Java. RMI also provides 
distributed garbage collecting. 

The basic feature required to make DOOS effectively distributed is the possi- 
bility to mount an .object system onto another .object system. This is illustrated 
by the following code. Although it will not be detailed here, it mainly consists 
in copying an .object to be mounted into a remote .object the mount point: 

public static void main (String args[]){ 

OpenEntry myroot= 

new OpenEntry (null , args [0] , new PMachine (null , args [0] ) ) ; 
OpenEntry mountPoint=D00S . open("/net/alpha. labri .u-bordeaux. fr") ; 
mountPoint . copyln(myroot) ; 

} 

7 Example 

The following example shows the implementation under DOOS, working at the 
open .object level (see Fig. ED, of a command equivalent to the Unix Is command: 

public void list (String path){ 

OpenObject oo = DOOS . open (path) ; 

00 . startChildrenO ; 

while (oo.hasMoreChildren){ 

OpenObject child = oo .nextChildO ; 

System. out . pr intln (child. getNameO ) ; 

DOOS . close (child) ; 

} 

DOOS . close(oo) ; 

} 



The major functional difference with the Unix Is command is that the DOOS 
command can be applied to any .object and it works both with local and remote 
.objects. 
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8 Conclusion and Future Work 

We have presented version alpha 0.1 of the JEM-DOOS Distributed Objects 
Operating System. We emphasized the fact that it is built on the same concepts 
as the UNIX file handling system, as defined by POSIX.l. We have introduced 
remote references to .objects based on the Java/RMI mechanism. We illustrated 
the use of the current implementation of the system. 

JEM-DOOS is still evolving and more testing is required. Future work is 
concerned with the implementation of features for which we already provide 
appropriate structures, without using them. These are for instance access rights. 
In the longer term, we intend to provide a POSIX. 2-like interface to final users 
of JEM-DOOS. 
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Abstract. A network of objects is a set of objects interconnected by 
pointers or the equivalent. In traditional languages, objects are alloca- 
ted individually, and networks of objects are assembled incrementally. 
We present a set of language constructs that can create static networks: 
networks of objects which are created atomically, and which are immu- 
table. Then, we present some of the most interesting abilities of static 
networks. 



1 Introduction 

In most concurrent object-oriented languages, objects are allocated individu- 
ally, and networks of objects are assembled incrementally. We have developed 
a new way to allocate networks of objects. When using our constructs, entire 
grids, trees, or graphs pop into existence instantaneously. The structures crea- 
ted this way are immutable. Such networks are termed static networks. We have 
discovered that static networks lead to more elegant code and better software en- 
gineering. For example, we have discovered that static networks make it possible 
to define a new kind of lexical scoping mechanism. We have also discovered that 
static networks can dramatically reduce the amount of code needed to interface 
objects to each other. 

2 The Constructs 

Our static network construct could be added to any existing concurrent object- 
oriented language. We have been using a variant of Java called “Distributed 
Java” for our experiments. It is a very conventional concurrent object-oriented 
language, we chose it for its simplicity and familiarity. It contains the usual con- 
structs of Java. To this, we added several conventional concurrency constructs. 

Our first addition to Java is the operator new classname(constructorargs) on 
processor, that allocates an object on a specified processor. It returns a proxy of 
the object: a tiny object that masquerades as the real object. When one invokes a 
method on the proxy, the arguments are transparently shipped to the real object, 
and the return-value transparently shipped back. If the object is remote, the 
method receives copies of the parameters which are indistinguishable from the 
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originals. If the parameters are of numeric types, this is easy. If the parameters 
are of object types, the callee receives a proxy. It is often desirable, when passing 
data across processor boundaries, to copy the data instead of creating a proxy. 
Therefore, we add a parameter-list keyword copy that allows the system to copy 
the parameter instead of provide a proxy. 

Java already has threads. We added a shorthand form of the thread creation 
mechanism: ohject<-method(arguments) . The new thread invokes a method on an 
object and then terminates. Unlike some concurrent object-oriented languages, 
we allow multiple threads to execute within the same object at the same time. 
However, any contiguous sequence of statements is atomic unless it contains a 
blocking statement. There are two blocking statements: method invocation, and 
the wait statement. The wait statement causes the current thread to suspend. 
Any modification to the instance variables of the current object awakens the 
thread. In other words, it means wait for an instance variable to be modified. 
There is a second form of the wait statement: wait condition. It checks the 
condition, and if false, it suspends. When reawakened by the modification of an 
instance variable, it rechecks the condition, and if still false, it resuspends. 

The operation newgroup classname(constructorarguments) [size] creates a 
one-dimensional array of objects that spans multiple processors. It returns the 
handle for the group. One can invoke a method on any element of the group by 
saying grouphandle [index] .method (arguments ) . One can invoke a method on all 
elements of the group by saying grouphandle]ALL]<-method(arguments). One 
can invoke a method on an arbitrarily-selected member of the group by simply 
saying grouphandle.method(arguments). The latter behavior emulates the beha- 
vior of Concurrent Aggregates P , HAL groups |2] , and Charm-|— I- branch offices 
0. It enables a group to act as a bottleneck-proof object. Any member of the 
group can determine its position in the group by evaluating the pseudovariable 
thisindex. Any member of the group can obtain the group handle by evaluating 
the pseudovariable thisgroup. 

2.1 Static Object Hierarchies 

Now that we have a simple foundation language, we add the static network 
support. The core of our static network support is the agent declaration. The 
agent declaration is most easily explained by comparing it to some traditional 
code with a similar effect: 

class Ct { 

DT[10]; 

Ct(){ 

for (int i=0; i<10; i++) 

T[i] = new D(IOOO); 

} 1 ; 



class C { 

agent T(int i) is D(IOOO); 

1 ; 



Class Ct is traditional code. Each time an object of class Ct is allocated, 10 
objects of class D automatically pop into existence (because of the constructor). 
Those objects are named T/d/, T]l], etc. When you allocate an object of class 
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Ct, you’re effectively allocating a small tree of objects, with the object of class 
Ct at the root, and the 10 objects of class D underneath. Class C uses the agent 
declaration to achieve an almost-identical effect. Each time an object of class C 
is allocated, a conceptually infinite set of objects of class D pop into existence. 
Those objects are named T(0), T(l), etc. They are allocated lazily the first 
time they are accessed. So once again, allocating an object of class C effectively 
allocates a small tree of objects. 

More generally, the syntax of the agent declaration is is agentname(indices) 
is classname(constructorargs). It must always occur inside a class declaration. 
Assuming it occurs inside a class C, it declares that each object of class C has 
several “agents” (other objects working for it). The object of class Ccan access its 
agents by evaluating the expression agentname (indices), and the system ensures 
that they are there when accessed. The agents are of class classname and are 
initialized with the specified constructorargs. The constructorargs can refer to 
the agent indices and to the constructor arguments of class C. There can be 
as many indices as are desired, they can be any type which could reasonably 
function as a hash table key. 

The real difference between using a constructor, as in Ct, and an agent decla- 
ration, as in C, is that Cs hierarchy is immutable. It also presents the illusion 
of having been created atomically. 

2.2 Communication Across the Hierarchy 

The agent construct can only create trees of objects. However, if we allow sibling- 
to-sibling communication, then the hierarchies can also emulate other structures. 
Consider, for example, a class jacobi containing an agent declaration agent jn- 
ode(int i, intj) is jacohinode(). The agents form a shallow hierarchy with a single 
jacobi object in charge of innumerable jacobinode objects. If jnode(i,j) invokes 
methods on its siblings jnode(i+l,j), jnode(i-l,j), jnode(i,j+l ), and jnode(i,j-l ), 
then the lines of communication form a grid. So this agent hierarchy is in one 
sense a tree, but in another equally-meaningful sense it is a grid. Because we want 
to be able to represent arbitrary structures like grids, trees, graphs, and other 
communication patterns, it is important that the objects in an agent hierarchy 
be able to send messages to their siblings. 

If you implement hierarchies of agents using parent pointers and child poin- 
ters, then sibling-to-sibling communication cannot be done. The siblings don’t 
have pointers to each other. Because of this, we use a completely different imple- 
mentation procedure for agents. We assign each object an ID string according 
to two simple rules. There are two disjoint kinds of objects: those allocated with 
new, and those which are agents. The ID of an object allocated with new is of the 
form classname#seqno@processor, where the seqno is a unique ID to tell objects 
apart, and processor is the location of the object. The ID of an object which is an 
agent is of the form ownerid.agentname(indices), where ownerid is the ID string 
of the agent’s owner, and agentname (indices) is the agent’s name and indices as 
they appear in the code. An object in the agent hierarchy can easily use string 
manipulation to compute the ID of a sibling, parent, or child. IDs are mapped 
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statically to processors, and every agent is stored in a hash table (by ID) on its 
home processor. 

When remotely invoking a method on an agent, one must perform four steps 
that are not necessary for other objects. On the sender side, one must compute 
the ID of the agent, one must compute its hash value, and one must determine its 
home processor by mapping its ID to a processor number. At the receiver side, 
one must look the object up in the hash table. We picked a typical statement 
that involves all these steps from one of our demonstration programs and hand- 
optimized it. We executed the code on a Cyrix PR266 processor, and discovered 
that the sender-side steps took 0.5 microseconds, and the receiver-side steps took 
0.2 microseconds. It is our judgment that these values are within the acceptable 
range. Note that our prototype implementation does not yet generate code of 
this quality. 

The user of Distributed Java never sees an agent ID, they are hidden inside 
proxy objects. One utilizes a variety of notations to obtain proxy objects. The 
simplest is when an object X evaluates the expression agentname (indices). The 
system implicitly concatenates “.agentname(indices)” to the ID of X, hides the 
resulting ID inside a proxy object, and returns the proxy. Another notation is 
when an object X evaluates the keyword owner. The system implicitly truncates 
the ID of X, yielding the ID of X’s parent in the hierarchy. Again, it hides the 
ID in a proxy and returns the proxy. 

Getting the proxy of a sibling is implementationally easy, but it would break 
scoping rules. Consider class C in Section 12.1 1 Class C contains the declaration 
of the T agents. So it is reasonable to refer to those agents by name (e.g. T(0), 
T(l), etc.) inside the methods of C. But it is not reasonable to refer to the name 
T in the methods of D, since the declaration of T isn’t anywhere near D. There 
are two solutions. The first is to move the declaration of D inside class C, as 
shown below. This puts the code of D into a scope where the T agents are visible. 
That way, the methods of class D can straightforwardly refer to their siblings by 
name. 

class C { 

agent T(int i) { /* insert body of class D here */ } 

1 ; 



The second solution is used when you can’t move D into C, for example, 
when D is a library class. In that case, the solution is to program D so that it 
sends its output to its owner. The owner can then forward the data to wherever 
it needs to go. The data goes from D to C to D. To avoid a bottleneck at 
the C object, we provide relay methods. One places the keyword relay in front 
of a method declaration. The compiler prohibits relay methods from accessing 
instance variables. But since the relay method doesn’t actually access the object, 
it can be performed on any processor. Invocation of a relay method does not cause 
a remote method invocation, the method is performed locally. To use this in our 
sample problem, the object of class D invokes a relay method in class C, which 
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does nothing except invoke a method on another D. At the implementation level, 
only one remote method invocation happens, going from D to D. 



2.3 The From Clause 

We now add one more feature. We allow an object to ask: “who invoked this 
method?” At a glance, this may not appear to be related to static networks. 
However, if one were to ask this question in a traditional language, the only 
answer one could expect would be something like “you were invoked by object 
#5907.” That answer doesn’t do you any good. Static networks make it possible 
to get an answer of the form “this method was invoked by agent T(5)”, which 
is a meaningful and useful answer. So effectively, static networks are what make 
the from clause possible. Its notation is shown in the sample code below: 

class C { 

agent T(int i) { ... owner<-data(f(g(i))) ... } 
public void data(int n) from Tfint j) { 

printf("T(%d) sends the result %d\n",j,n); 

} }; 

In this sample, each agent T(k) computes a value and sends it back to its 
owner, the object of class C, by invoking the data method on it. If an agent T(k) 
invokes data, then the from clause will cause data to execute with x bound to 
k. If any other object tries to invoke data, this definition of the method will not 
be visible. However, method overloading is permitted as usual, so there may be 
other definitions of data, one of which may be visible. 



3 Shared Variable Scoping 

One surprising ability of static networks is their ability to create lexical scoping 
rules for shared data. Some concurrent object-oriented languages do provide glo- 
bal variables, which are shared. However, if an algorithm stores data in a global 
variable, you typically can’t run multiple copies of that algorithm at the same 
time. In other words, the use of global variables tends to make your code non- 
reentrant. This limits the utility of global variables in a parallel program, where 
the whole idea is to run many copies of many algorithms concurrently. Because 
of these reentrancy problems, many concurrent object-oriented languages do not 
provide shared variables at all. But shared variables are quite useful: it’s their 
global scope which is the problem. 

Agents can provide shared variables which aren’t global. Consider the code 
below. The left side shows the prototype of a writeonce class which holds a single 
integer. It provides methods set and get which can set and retrieve the integer. 
The get method is supposed to block until the value has been set. The right side 
shows class K. The object of class K will own a writeonce object named V and 
a number of objects named T(0), T(l), etc. 
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class writeonce { 


class K { 


void set(int n); 


agent V is writeonce(); 


int get(); 

}; 


agent T(int i) { /* these objects may refer to V */ ); 

}; 



We can define the word variable to mean a storage location named by an 
identifier, where the identifier has a scope. By that definition, the agent V is 
a variable. It is accessible to several objects T(0), T(l), etc, so it is a shared 
variable. Its scope is not global. If I create a two instances of class K, there will 
be two copies of this hierarchy, and each will include its own copy of V. The two 
hierarchies will function independently of each other. In short, static networks 
are giving us true static scoping. This in turn makes it possible to use shared 
data without sacrificing reentrancy. 



4 Improved Compositionality 

Concurrent object-oriented languages make it extremely difficult to interface 
concurrent objects to each other. To demonstrate the problem, we will attempt 
a programming task that should be easy, but turns out not to be. 

The task is to start with the two library classes shown below, and compose 
them in such a way that that they perform the computation A * B * C, where 
A, B, and C are square matrices, matmul below is a simple matrix multiplier, 
one feeds in the rows of A and columns of B, and it sends out the elements of 
the result matrix. conv2rows is a converter that rearranges the elements of a 
matrix: one feeds in columns or individual elements, and it produces rows. Both 
classes use continuation-passing style (CPS). In other words, they accept the 
handle of a group sendto, and they produce their output by sending it directly 
to the specified group. We chose continuation-passing style because it is the 
only efficient interface for these classes. Using call-return communication would 
double the bandwidth requirements or introduce unnecessary bottlenecks. CPS 
is common in concurrent object-oriented programs. 



class matmul { 


class conv2rows { 


matmul(int rows, int cols, object sendto); 


conv2rows(int rows, int cols, object sendto); 


void row_a(int r, vector v); 


void coljint c, vector v); 


void col_b(int c, vector v); 

1; 


void elt(int r, int c, double v); 

); 



The design of the code is as follows. We will create a group mulabc to be 
the coordinator. For consistency, it will expect all the inputs A, B, and C to be 
fed in one column-vector at a time. The mulabc group will create two matmul 
groups, one to multiply {A* B), and one to multiply {A* B) * C. It will create 
a group conv2rows to convert the columns of A into the rows of A, and another 
conv2rows to convert the elements of {A* B) into the rows of {A* B). 

Though the first matmul group has a sendto parameter, I cannot simply 
configure it to send its output to the first conv2rows. The problem is that the 
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method names don’t match: matmul wants to invoke the result method, but 
conv2rows wants its user to invoke input-dt. The solution is to configure the 
matmul such that it sends its output back to the mulabc. The mulabc then 
forwards the data back to the conv2rows, using the right method name. In fact, 
we must use the same strategy everywhere: all groups will send their intermediate 
results back to the mulabc for dispatching. 

CPS now creates a problem: the mulabc cannot tell the output of the first 
matrix multiplier from the output of the second multiplier. To distinguish them, 
we have to send the two outputs to two different places. So we must create a 
second group mulabcl, to give the second multiplier a place to which to send its 
output. This completes the design, two versions of the code are shown below: 



class mulabc { 

matmul mml, mm2; conv2rows cr; 

mulabc(int size, object sendto) { 
if (thisindex == 0) { 
in termed = newgroup mulabc 1(); 
mml = newgroup matmul(size, size, intermed); 
mm2 = newgroup matmul(size, size, thisgroup); 
cr = newgroup conv2rows(size, size, intermed); 
er = newgroup conv2rows(size, size, thisgroup); 

// broadcast object handles to all members of group. 
intermed[ALL]<-distribute_handles(mm 1 , er); 
thisgroup[ALL]<-distribute_handles(mml, mm2, cr); 

} 

} 



class mulabcl { 
matmul mml; 
conv2rows er; 
void distribute_handles 

(matmul mml_a, conv2rows er_a) { 
mml = mml_a; 
er = er_a; 

I 

void row(int row, vector v) { 
wait (mml != NIL); 
mml<-row_a(row, v); 

) 

void result(int row, int col, double d) { 
wait (er != NIL); 
er<-elt(row, col, d); 



public void distribute_handles(matmul mml_a, 
matmul mm2_a,conv2rows cr_a) 
mml = mml_a; 
mm2 = mm2_a; 
cr = cr_a; 



void input_A(int col, vector v) { 
wait (cr != NIL); 
cr<-col(vol, v); 

} 

void input_B(int col, vector v) { 
wait (mml !=NIL); 
mml<-col_b(col, v); 

} 

void input_C(int col, vector v) { 
wait (mm2 != NIL); 
mm2<-col_b(row, v); 

} 

void row(int row, vector v) { 
wait (mm2 != NIL); 
mm2<-row_a(row, v); 

} 

void result(int row, int col, double d) 
wait (sendto != NIL); 
sendto<-result(row, col, d); 

II; 



class mulabc { 
mulabc(int size) { }; 
agent mml is matmul(size, size); 
agent mm2 is matmul(size, size); 
agent cr is conv2rows(size, size); 
agent er is conv2rows(size, size); 
relay void input_A(int col, vector v) { 
cr<-col(col, v); 

I 

relay void row(int row, vector v) from cr { 
mml<-row_a(row, v); 

I 

relay void input_B(int col, vector v) { 
mml<-col_b(col, v); 

I 

relay void result(int row, int col, double d) from mml { 
er<-elt(row, col, d); 

I 

relay void row(int row, vector v) from er { 
mm2<-row_a(row, v); 

I 

relay void input_C(int col, vector v) { 
mm2<-col_b(row, v); 

I 

relay void result(int row, int col, double d) from mm2 { 
owner<-result(row, col, d); 

II; 



The version using agents (inset) is about half the size of the traditional ver- 
sion. The complexity in the traditional version comes from four sources. First 
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mulabc has to be a group to avoid bottlenecks. Agents eliminates the bottleneck 
by means of relay methods. Second, everything must be allocated, connected to- 
gether, and pointers passed about: this takes a significant amount of code. Third, 
the traditional version contains much explicit synchronization to deal with the 
fact that the network is not created atomically. Agents creates the entire struc- 
ture instantaneously, eliminating the need for explicit synchronization. Fourth, 
the from clause eliminates the problems created by CPS, which eliminates the 
need for mulahcl. The combined effect is dramatic. 

5 Summary and Conclusions 

We have identified a new way to allocate networks of objects: atomically. We de- 
veloped a language-level construct, the agent declaration, that allows us to create 
hierarchies of objects atomically. Such hierarchies can emulate other structures 
like grids and graphs. We described two major advantages. One, they enable us 
to support static scoping rules, enabling the reentrant use of shared data. Two, 
they significantly simplify the interfacing of modules, improving the composi- 
tionality of the language. These are only two of the uses: a paper of this size 
is not sufficient to describe all the possibilities. We have discovered that agent 
ID strings make a powerful tool when used for program trace analysis. We have 
discovered that the agent construct subsumes the group construct, allowing us to 
eliminate it. This means that the language is not becoming more complex with 
the addition of agents. We have discovered that the zero-cost creation of agent 
structures makes it possible to set up an large structure and only use it once. 
This leads to a programming style that is more like functional programming. We 
do not expect to stop discovering new uses for static networks. 
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Abstract. We describe SsQueue(SnapshotQueue), an implementation 
for an efficient and user-friendly class library for FIFO queue that can 
be used for state vectors in simulated queuing network constructs under 
Time Warp parallel discrete event simulation (PDFS) protocol. There 
exists a general purpose Time Warp simulation kernel warped, where 
users have only to define a state vector and do not have to care about 
rollback and state recovery. However, since the state vector should be 
defined as an inlined data structure, it is not suitable for dynamic data 
structures such as FIFO queue. This class can also serve as an element of 
such a state vector, then both libraries and users can handle each instance 
as snapshot of the queue. Taking advantages of FCFS nature of the 
above data structure, operation histories rather than all contained items 
can be safely stored and restored using this class library with virtually 
minimum overhead. When the kernel deletes instances in the simulated 
past, corresponding methods perform garbage collections transparently. 

1 Introduction 

Discrete event simulation (DES) technology has been playing an important role 
in design and evaluation of computer and communication systems. However, 
rapid growth of these systems in both scale and speed have made DES relatively 
slower. Large scale, high-speed Asynchronous Transfer Mode (ATM) networks 
and its components such as multi-stage switches are among these systems. We 
can benefit from parallel processing provided that simulation models of such 
systems have ample parallelism. In this case, simulation models are divided into 
sub-models called Logical Processes (LP). Each LP maintains its own simulation 
clock and internal state, propagating any causal effects to other sub-models 
using external event messages. Many parallel discrete event simulation (PDES) 
protocols to synchronize LPs have been proposed, which are roughly divided into 
two categories. In conservative protocolsf^, each LP strictly adheres to causality 
constraints. In optimistic protocols |2|, each LP proceeds speculatively. When a 
causality error occurs, it rollbacks, restores its state, and resumes simulation. 

In optimistic protocols, state saving is necessary to rollback and coast-forward 
along simulated time. However, state saving and rollback are too complex and 
error prone for application programmers to implement by themselves. 

Although there are several simulation libraries and languages that support 
these mechanisms automatically |2l 0) QEl, they all adopt fixed-size inlined data 
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structures (e.g. integers, floating-point numbers) as the elements of their state 
vectors. They are not suitable for dynamic data structures such as queues in 
queuing network simulations. 

There might be, of course, some ways to implement queuing data structure 
using inlined data structures. For instance, we can use LPs to simulate each 
container of a queue (Fig.^a)). Such a method may be most suitable for hard- 
ware simulation practitioners who are interested in detailed behaviors of queues 
themselves. However, for designers who are more interested in higher- level si- 
mulations where queues are viewed merely as FIFO buffers, the above method 
is both too complicated and inefficient. Furthermore, since there is no “Global 
State” in PDFS, information regarding to which LP has the next candidate of 
“enqueue” and “dequeue” must be exchanged using additional external event 
messages. 



queue container LPs 

aaDDaa 



Server LP 




head tall 




(a) (b) 

Fig. 1. Alternative implementation strategies for FIFO queue; (a) associating each 
container with an independent LP and (b) using an array of containers as a ring buffer 



Another way to implement a queuing data structure as an element of a state 
vector using inlined data structure is to use an array of containers (pointers) 
whose length is equal to the maximum number of items to be inserted (Fig.^b)). 
Although this method is easy to implement, the memory consumption and con- 
sequently the cost of each state saving is always proportional to the array length. 

In addition, both of the above techniques must fix the maximum number 
of items that can be inserted a priori. This constraint is burdensome when one 
wants to determine by simulation how long should be the lengths of queues in 
switches to achieve a required upper bound in loss rate. In such a simulation, 
the maximum instantaneous queue length may be of interest. 

This problem can be solved using incremental state saving (ISS) facilities in 
SPEEDES0. In using IIS, not all elements themselves have to be stored in order 
to accomplish state saving of such a “redundant” data structure. 

The SPEEDES kernel also maintains rollback and coast-forward automati- 
cally, but it performs user-provided DO/UNDO operations in the state queue 
rather than copying-in and copying-out of the corresponding snapshot in the 
state queue as in warped |2|. 
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Any operations that can be undone are successfully maintained using ISS. 
However, when a FIFO queue is implemented using IIS, the advantage of its 
FCFS nature is not fully exploited. Consequently, the cost of rollback is propor- 
tional to the rollback distance since successive operations must be undone one 
by one. 

In our approach, the cost of rollback is independent of the rollback distance 
(except for the cost of invalidating cancelled state history), because the kernel 
simply lets the application programmer access the appropriate snapshot instead 
of the current state. 

The combination of our approach and warped exploits another benefit. The 
kernel may want to take snapshots of the state infrequently as an optimiza- 
tion. Our approach is independent of the frequency of state saving, as multi- 
ple enqueue/dequeue operations can be represented without taking snapshots 
each time. In IIS, on the other hand, each enqueue/dequeue operations require 
DO/UNDO data structures to be saved. That is to say, the frequency of state 
saving is equal to the frequency of enqueue/dequeue operations. 

The rest of this paper is organized as follows. Section 2 gives an overview 
of our class library. Section 3 describes design and implementation as well as 
API of our class library in detail. Section 4 provides an example of Time Warp 
queuing network simulation using our class library. Final comments and conclu- 
ding remarks are given in Sect. 5. 

2 Overview of the Class Library 

In a FIFO queue, unlike a priority queue, rearrangement of elements does not 
occur, so it is easy to implement “difference” based state savings. Common 
elements among snapshots (elements enqueued before the previous snapshot and 
not dequeued as of current snapshot) are shared via pointers. These mechanisms 
are encapsulated into the C-|— I- class library implementation and are transparent 
both to application programmers and to the simulation kernel. Henceforth we call 
this package SsQueue (6hap shot Queue), since each of their instances represents 
a snapshot at the moment it are created. SsQueue achieves virtually minimum 
state saving and recovery overhead (each instance consists of five pointers and 
an integer) which is independent of the number of items inserted. 

3 Implementation 

This section explains the API from a programmer’s point of view, the interface 
from the simulation kernel’s point of view, and finally the internal representation. 

3.1 API on User’s Side 

The interface on the user’s side is quite simple. 

— void enqueue (void *item) inserts an item. 
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— void *dequeue() removes and returns the first item. 

— void *top() is a non-destructive version of dequeue (). 

— int getLengthO obtains current length. 

3.2 Interface on Simulation Kernel’s Side 

The Time Warp simulation kernel performs periodic state saving by copying cur- 
rent state variables to the state queue. When rollback occurs, the snapshot just 
before the rollback point in the state queue is copied back to the current state 
variables. SsQueue traps both operations by providing a custom copy construc- 
tor and assignment operator. In addition, when the kernel “deletes” (memory 
resources occupied by committed states in the state queue should be returned to 
the system) an SsQueue instance, its destructor performs garbage collection of 
its own. Modification of the simulation kernel is not necessary in using SsQueue. 

3.3 Internal Representation 

Implementation of the temporal relationship and sharing elements among neig- 
hboring snapshots in SsQueue is shown in Fig. El Application of enqueuing and 
dequeuing operations to the current state {S{t + 1) in the figure) is straightfor- 
ward. In enqueuing an item, the corresponding container is added at the tail 
of the container’s list, and tail pointer is modified to point to that container. 
When an element is dequeued, the item pointed to by head is returned to the 
user, head moves on to the next container (in the figure, container that points 
to “g”). Note that dequeued items and their containers are not actually “dele- 
ted” immediately. They are deleted when the kernel deletes the corresponding 
snapshot as described below. 
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SsQueue instances 



List of Containers 
Items 



Fig. 2. Temporal relationship and sharing elements among neighboring snapshots in 
SsQueue instances 



A temporal relationship is established when the kernel performs state sa- 
vings and restorations. Overriding copy constructor and assignment operator 
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trap assignment operations in both cases. When a state is saved, the assign- 
ment operator is called with source being the current state and target being a 
snapshot in the state queue. In this case, SsQueue links this new snapshot just 
before the current state (i.e., next of the new snapshot points to the current 
copy of SsQueue and prev of current copy points to the snapshot). On rollback, 
the kernel copies the latest committed snapshot back to the current state varia- 
bles. SsQueue knows this fact since this time, the assignment operator is called 
with a snapshot as its copy source. This causes SsQueue to invalidate cancelled 
snapshots and corresponding items (they may be recycled before actually being 
deleted), and re-link the restored state as direct predecessor of the new current 
state. 

Garbage collection is invoked from the overridden destructor when kernel 
deletes committed states. Take S{t— 1) in Fig.Elas an example. Containers (and 
items) from inithead to just before head must be deleted. Items “a” and “b” 
are deleted in this case. 

These constructs are transparent to both the simulation kernel and appli- 
cation programmers, letting the former remain unchanged, and the latter avoid 
handling of rollback and state recovery. The cost of state saving and restoration 
is reasonably small, as each snapshot requires only five pointers (prev , next , 
inithead, head, tail) and an integer (length). When a state is recovered, the 
selected state is always ready for subsequent enqueuing and dequeuing, yielding 
(other than invalidating descendants) constant and thus minimum overhead. 

4 Example Using SsQueue 

This section illustrates a Banyan switch simulation as an example of large-scale 
parallel queuing network simulations. A Banyan switch is a multi-stage self- 
routing switching network that is used in high-speed broadband network swit- 
ches. Figure 0shows how an 8 port, 3 stage Banyan switch is configured. Each 
2x2 switching element (inside round rectangle) reads a header in an incoming 
ATM cell, and selects an output port. The cell is buffered if it attempts to go to 
a port in use. All of these operations together form an 8 x 8 switch, performing 
point-to-point routing of ATM cells from input ports to output ports. 

The simulation program is build on top of warped 0 , a public domain Time 
Warp simulation kernel written in C-I--I-. In warped, users define their own LPs 
by inheriting the basic LP clasfl defining a custom event processing method 
which is called by the kernel. The basic LP class is provided as a template class 
that takes a state vector class as its argument. 

We defined four LPs to simulate a Banyan switch. They are CellSrcObj to 
generate input traffic, CellRouterObj , a 2 x 2 router, CellOutBuf Obj for output 
buffer and port, and CellSinkObj that collects cell statistics on each output port. 
Figure Elshows how the SsQueue class is integrated into the CellOutBuf Obj user 
program. 

^ In WARPED, the term LP refers to a per-processor entity which aggregates multiple 
simulation objects. We call these multiple objects LP here by convention. 
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Fig. 3. LPs constructing 3 stage Banyan switch 

1 ; /++**+**+****+** State variables definition +****+**+*+**+****+*+/ 
2: class CellDutBuf State : public BasicState ■[ 

3: public: SsQueue ssq; // SsQueue instance as a state variable 

4: int isBusy; // 1 if server is busy 

5:>; 

6: 

7:/*+**+**+****+*+**+**** LP definition +****+****+**+*+**+****+*+/ 
8: class CellQutBuf Obj : public SimulationObj<CellOutBufState> •[ 
9:public: void executeProcess () ; // user-defined event handler 

10 : ... 

11 :}-; 

12 : 

13:/****+**+****+**** User defined event handler *******************/ 
14:void CellOutBufObj : :executeProcess() •[ 

15: CellEvent *event = (CellEvent *)getEvent () ; // next evt to exec 

16: switch(event->type) { 

17: case ARRIVECELL: // Arrival of ATM cell 

18: SsQueue *ssq = & (state . current->ssq) ; 

19: if (! (state . current->isBusy) && ssq->isEmpty () ) {. 

20: // cell is forwarded immediately without queuing 

21: } else ■[ // buffer this cell 

22: Cellinfo *cell = new Cellinfo; 

23: *cell = ( (ArriveCellEvt *)ace)->cell; 

24: ssq->enqueue (cell) ; // cell is buffered into SsQueue 

25: } 

26: break; 

27: } 

28:}- 

Fig. 4. Coding example using SsQueue 
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State vector declaration is given in lines 2 to 5, including SsQueue as part of 
state variables (line 3). LP is defined using this state variable, as in lines 8 to 
11. Lines from 17 to 26 give an idea of how a cell arriving at CellOutBuf Dbj is 
handled. If the output port is busy, the cell is put into a SsQueue instance, as 
in lines 19 and 24. 

Performance evaluation of the whole simulation program is given as elapsed 
real time to simulate 50000 /rseconds(Fig.0. An experiment was carried out on 1 
to 32 processors of SR2201|E!, a distributed memory multicomputer. Relatively 
good speedup is achieved, especially when the number of switching elements 
(SEs) increases. 
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Fig. 5. Elapsed real time in seconds to simulate 50000 /rseconds 



5 Conclusion 

In this paper, SsQueue, an efficient and user-friendly class library implementa- 
tion of a FIFO queue for Time Warp simulation was described. Each instance of 
this class occupies a reasonably small memory space, so that state saving cost 
is independent of the number of items in the queue. Sharing containers among 
adjacent snapshots minimizes the cost of garbage collection. 

SsQueue stands between the provider of a general purpose Time Warp simu- 
lation platform and application programmers. The simulation kernel needs only 
support inlined data as state variables of Logical Processes, although it may be 
responsible for all aspects of optimistic synchronization. Using SsQueue, simu- 
lation programmers who do not want to be concerned about rolling-back and 
state saving can still have the queue data structure as state variables of Logical 
Processes in such a simulation kernel. 

Application to Banyan switch simulation was also presented, which showed 
both the simplicity of integration and the ability to be a part of real-world 
application programs. 
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Abstract. A profiler is an important tool for understanding the dy- 
namic behaviour of concurrent programs to locate problems and optimize 
performance. The best way to improve prohling capabilities and reduce 
the time to analyze a concurrent program is to use a target-specific pro- 
hler that understands the underlying concurrent runtime environment. 
A profiler for understanding execution of user and kernel level threads 
is presented, which is target-specific for the /rC-H- concurrency system. 
This allows the insertion of hooks into the /rC-H- data structures and run- 
time kernel to ensure crucial operations are monitored exactly. Because 
the profiler is written in /rC-H- and has an extendible design, it is easy 
for users to write new metrics and incorporate them into the profiler. 



1 Introduction 

As programs grow more complex, a greater need arises for understanding their 
dynamic behaviours, to locate problems and optimize performance. Concurrency 
increases the complexity of behaviour and introduces additional problems not 
present in sequential programs. An important tool for locating problems and 
performance bottlenecks is a profiler. However, sequential profiling techniques 
cannot be trivially extended into the concurrent domain. A concurrent profiler 
must deal with multiple threads of control, all potentially introducing errors and 
performance problems. Profiling concurrent programs has been done for perfor- 
mance analysis, algorithm analysis, coverage analysis, tuning, and debugging. 

We believe the best way to improve concurrent profiling capabilities and re- 
duce the time to analyze a concurrent program is to use a target-specific profiler 
that understands the underlying concurrent runtime environment. Our experi- 
ence in designing several target-specific concurrency tools (high-level concurrent 
extensions for C-H-, called ^C-H- P, a debugger 0, a profiler H, and other con- 
current toolkits) leads us to conclude that construction of an universal profiler 
for all languages and concurrency paradigms is doomed to failure. 

2 Motivation 

The basis of this work is /rC-H-, a shared-memory, user-level thread library 
running on symmetric multiprocessors (e.g., SUN, DEC, SGI, MP-PC); ker- 
nel threads associated with shared memory provide parallelism on multiproces- 
sors, and user threads refine that parallelism. The /xC-H- environment provides 
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a target-specific debugger for break-point debugging on a user-level thread ba- 
sis, and experience has shown it aids in the development of robust concurrent 
programs. Nevertheless, debugging is normally based on a hypothesis concerning 
the reason for the erroneous behaviour of a program. To reason about general 
runtime behaviour, including performance analysis, coverage analysis, and tun- 
ing, a profiling tool is needed to monitor execution and reveal information at 
different levels of detail. 

Several profilers for concurrent programs exist, but most are general purpose 
tools with little understanding of the concurrency paradigm. Each concurrent en- 
vironment provides a different paradigm, which a profiling tool must be aware of 
in order to provide effective monitoring, analyzing and visualizing of a program’s 
behaviour. The analysis of a concurrent program’s performance and algorithmic 
behaviour becomes more effective and efficient through target-specific profiling, 
where the profiler has internal knowledge about the runtime system intrinsics 
and the underlying programming paradigm. 

Extendibility is also crucial in designing and implementing an effective pro- 
filer, since it is impossible to predict suitable metrics for all imaginable situations. 
Therefore, a profiling tool should provide a set of general purpose metrics and a 
mechanism enabling a program analyst to quickly develop new problem specific 
metrics. Hence, an analyst must use knowledge about the profiler, which is eas- 
ier when the profiler operates as part of the target system and when the metric 
extensions can be written in a familiar language. Ideally, the same language is 
used for the profiled program and the profiler extensions. 

Finally, a profiler must operate at different levels of detail on concurrent 
programs to provide the functionality for both exact and statistical profiling. To 
profile large-scale concurrent programs, selective profiling must be supported: It 
must be possible to turn profiling on and off dynamically to target specific parts 
of a large program. 



3 Related Work 

Most profiling tools have been developed for analyzing the performance of sci- 
entific, mostly data-parallel programs, written in a message-based programming 
environment. For this arena, successful and powerful tools with a wide range of 
analysis and visualization modules exist. For example, Pablo P] is a tool with 
many visual jSj and audio |S| performance data presentation modules. Pablo 
also introduced a standard trace log format which is adopted by other pro- 
file analysis and visualization tools. Another example is Paradyn |Z), a tool for 
profiling large-scale, long-running applications. These program characteristics 
require some novel instrumentation and analysis methods: dynamic instrumen- 
tation insertion and removal based on execution-time profiling information, or 
user interaction. The results of dynamic instrumentation are promising, but the 
overhead introduced may reduce effectiveness when profiling code-parallel (in 
contrast to data-parallel) programs with shorter execution times. 
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Concurrent profiling tools may be available only as part of the operating 
system, which allows monitoring of programs, and information about calls for 
kernel thread creation, synchronization and communication primitives. For ex- 
ample, the Mach Kernel Monitor instruments kernel thread context switch- 
ing. This approach assumes that the concurrent program’s runtime system uses 
only operating system features, instead of providing portable, user-level thread 
creation, synchronization and communication primitives. 

Among the first profiling tools for a user-level thread-library was Quartz . 
A target-specific profiling environment for concurrent object-oriented programs 
is pC-H- M- pC-H- is one of the few cases where the integrated performance 
analysis environment TAU m was implemented in concert with the language 
and runtime system. However, the design of pC-|-l-/TAU incorporates most of its 
profiling functionality into the preprocessor and runtime system, so extending 
the profiling metrics by the program analyst takes more effort. The tight coupling 
between the language/runtime system and the profiling tool makes integration 
into other existing thread-libraries infeasible. 



4 //Profiler 

/^Profiler is a concurrent profiler, running on UNIX based symmetric shared- 
memory multiprocessors, that achieves our goal of target-specific, extendible, 
fine-grained profiling on a user-level thread basis. /iProfiler supports the /rC-H- 
shared-memory programming model, which shares all data and has multiple 
kernel and user-level threads. Profiling /rC-H- programs requires incorporating 
both concurrent and object-oriented aspects, i.e., profiling different threads of 
control at the per-object level. 

Profiling sequential programs is non-trivial but well-understood. Additional 
challenges arise when profiling concurrent programs in a shared-memory envi- 
ronment similar to /iC-H-. Since the environment provides user-level tasks, the 
profiler must monitor the program’s activity at that level. The profiler also needs 
internal knowledge about the runtime system to identify and monitor each ex- 
ecuting task independently and exactly. The ^Profiler design deals with these 
challenges and presents a mechanism to effectively integrate extendible profiling 
into the concurrency system. 

/iProfiler is a concurrent program written in /iC-H-, executing concurrently 
with the profiled /iC-H- application (see Fig. QJ. A cluster, which groups user 
and kernel threads and restricts the execution of those user threads by the kernel 
threads in the cluster, is the /iC-H- capability which enables concurrent appli- 
cation and profiler execution. The user threads in the profiler cluster monitor 
execution of the runtime kernel and other clusters using direct memory reads 
via shared memory. On multiprocessor computers, the kernel thread in the pro- 
filer cluster executes the profiler user threads in parallel with the application. If 
the amount of the monitoring is large, more kernel threads can be added to the 
profiler cluster to increase parallelism. So application performance is degraded 
only by the contention created by profiler operations. This cost can be 100 to 
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Fig. 1. Integration of /iProfiler into the profiled program 



1000 times less than monitoring from a separate UNIX process, which requires 
cross-address-space reads. More complex monitoring is thus possible, while still 
having only a small effect on the application. 

To access ^Profiler, compilation flags -profile and -kernelprofile cause the neces- 
sary instrumentation insertion and linking with the profiling libraries. The first 
flag profiles only the user program; the second flag profiles the ^C-H- kernel calls 
made by the program. (The latter information is often inappropriate and con- 
fusing to users.) When the program starts, a menu appears, from which a user 
selects several builtin metrics, after which the program is run, and the metric 
output appears. Thus, in the simplest case, instrumentation insertion and ac- 
tivation of the profiling modules is completely transparent to the programmer. 
Additionally, parts of a program may be compiled with or without the profile 
flag(s), and then linked together, creating an executable where selected parts are 
instrumented and profiled. Finally, even more precise control is available through 
routines in the /xC-H- runtime system to turn profiling on and off for a particular 
thread at any point during execution. 

Because /xProfiler is truly integrated with /xC-H-, it was possible to insert hooks 
into the /xC-H- runtime kernel to ensure crucial operations are monitored exactly, 
such as user and kernel thread creation/destruction, and migration of user and 
kernel threads among clusters. Purely statistical monitoring and dynamic in- 
strumentation could miss some of these events. Also, dynamic instrumentation 
is considered too expensive when profiling programs with short or intermediate 
execution times. Exact routine counts are obtained via static instrumentation 
insertion at compile-time using shared trampolines. The C compiler -pg option 
is used to generate routine-entry instrumentation; the routine-entry instrumen- 
tation is augmented to generate routine-exit instrumentation. In addition, arbi- 
trary hooks can be inserted into user code by the program analyst. 

All hooks can be dynamically activated and deactivated on a per thread basis, 
but only when the profiler is present in the application (i.e., the existence of the 
profiler is checked for dynamically inside the runtime kernel). Each activated 
hook results in the profiled thread sending event (s) to the profiler, which passes 
the information to the active profiling monitors. F igure P ( a ) | shows a (simple) 
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(a) Exact User Thread Metric (b) Exact Kernel Thread Metric 

Fig. 2. /^Profiler Exact metrics 



exact metric operating at the user thread level. For each user thread, gprof- 
like [1 routine call information is available, including call cycles. Function calls 
from within each executed routine are presented with the corresponding routine 
call count information. Figure |2( b)| shows a (simple) exact metric operating 
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at the kernel thread level. For each kernel thread, UNIX level information is 
available, including a Kiviat graph to quickly relate different time metrics. 

For statistical monitoring, /iProfiler can monitor selected threads by period- 
ically sampling at a dynamically adjustable frequency to collect profiling data 
with minimum interference. The current implementation of /^Profiler monitors the 
profiled program at the per-cluster, per-thread, routine level. Figure |3(a)| shows 
a statistical metric operating at the cluster level. For this cluster, performance 
statistics are displayed for each task executing on it, broken down by task states 
(running, ready, blocked). “Coverage Time” is the percentage of time the task is 
sampled. By clicking on any of the tasks listed for a cluster, detailed information 
is available for that task. Figure |3(b)| shows a statistical metric operating at the 
task level. For this task on the cluster, performance statistics are displayed for 
each routine call executed by the task, broken down by task states. 

To ensure a high degree of flexibility and extendibility, /iProfiler is subdivided 
internally into parts representing the underlying functionality, including a profil- 
ing kernel, execution monitors, metric analyzers, and visualization devices. Each 
of these parts is split into submodules, which are ordered in a class hierarchy. 
To build a new metric requires building at least two components: an execution 
monitor component and a metric analysis component. An execution monitor is 
built as follows. Determine the functionality of the metric: e.g., exact, statisti- 
cal or both kinds of profiling. Then create a C-H- class that inherits from the 
abstract class uExecMonitor, and specialize a subset of uExecMonitor’s virtual rou- 
tines to provide the necessary functionality. Class uExecMonitor provides virtual 
members for different purposes such as routine entry/exit notification, periodic 
polling, etc. Finally, add an initialization call to the routine Initialize in the con- 
structor of the new class. Each execution monitor is responsible for operating 
and updating its own objects and possibly accumulating, filtering or summarizing 
the profiling data collected when the profiler task calls the registered members. 
Creating new metric analysis components is done in a similar manner by inher- 
iting from a class called uMetricAnalyze. Specialized members of uExecMonitor are 
automatically registered with the profiler during the call to Initialize along with 
the new execution monitor. /iProfiler maintains a list of all execution monitors 
and their member routines, and invokes them during execution as needed. Since 
the registration process of new metrics is done dynamically, they can simply be 
linked with the application, restart it, and /iProfiler calls into the metric class’s 
member routines when the requested events occur. 

Additional reuse is provided by inheriting from existing metrics that come 
with the /iProfiler library or previously built by the program analyst. All /iProfiler 
metrics conform to the above mechanism. uSPMonitor, for instance, is an execu- 
tion monitor that statistically samples a task and measures the time the task 
spends on a certain cluster in a certain routine in a particular state. Because 
uSPMonitor is based on statistical sampling, it inherits from uExecMonitor and 
specializes the poll routine, in which the data collection is performed. 

Through this mechanism, an analyst can efficiently extend the functionality 
of /iProfiler by metrics that fit the analyzed problem much better than general 
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Fig. 3. /iProfiler performance metric 



purpose metrics created by the developers, resulting in profiling results that di- 
rectly correspond to the problem under investigation. This approach enables an 
analyst to extend /^Profiler’s functionality with any metric, analysis or visualiza- 
tion device using exact or statistical monitoring. It is more important to integrate 
the basic functionality for different execution monitoring, analyzing and visual- 
izing methodologies on which both general and problem-specific modules can 
operate, than to build a fixed set of highly sophisticated metrics. 

Concurrency is also part of some object-oriented languages, e.g., /iC-H-. 
/iProfiler can identify corresponding objects (both caller and callee side) when 
a monitored task invokes an object’s member routines. While there are no pre- 
defined metrics using this feature, we anticipate them soon. 
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5 Conclusion 

Concurrent systems have complex dynamic behaviour with significant implicit 
information embedded in the runtime environment. Our claim is that target- 
specific profilers can do a better job extracting and displaying information from 
this environment. We show that tight integration possible with a target-specific 
profiler, i.e., between /iProfiler and /iC-H-, results in better information gathering 
at lower cost, and the ability to easily add new metrics through a single pro- 
gramming language. The /iProfiler displays are simple but informative, requiring 
the analyst to manually locate performance issues, e.g., hot spots, by examining 
the data. We have found manual determination to be straightforward, and have 
discovered several performance problems using /iProfiler while examining both 
/iC-H- and /tC-H- applications to understand their dynamic behaviour. 
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Abstract. Traditionally, the use of multithreading capabilities of ope- 
rating systems has been considered inadequate for implementing con- 
current object-oriented languages because of their inefficiency and non- 
portability. However, current operating systems encourage programmers 
to use threads to manage concurrent activities, since they offer a number 
of advantages such as multiprocessing capabilities and thread communi- 
cation through shared memory. To explore these issues, we have develo- 
ped Lince, a multithreaded runtime system for concurrent objects. We 
describe Lince and its design philosophy and analyze its performance. 
The use of popular threads packages allows us to simplify system design 
and enhance portability. The overhead of using threads for implementing 
concurrent objects is negligible for medium and coarse grain applications, 
although it can be too expensive for those requiring many fine-grained 
objects. 



1 Introduction 

Concurrent object-oriented programming is a paradigm that tries to take ad- 
vantage of object orientation in the development of concurrent software. Under 
this model, parallel programs are collections of concurrent objects that interact 
by invoking the operations they define in their interfaces. One of the challenges 
of concurrent-object oriented computing is to implement this high-level pro- 
gramming model efficiently and in a portable fashion. The term portable means 
not only that parallel programs can run on computers with different processors 
and/or operating systems, but also in different parallel architectures. Nowa- 
days, the most popular parallel architectures are multiprocessors and networks 
of workstations, so a concurrent object-oriented program should take advantage 
of these kinds of systems. 

Traditionally, one of the drawbacks of concurrent object-oriented languages 
has been their inefficiency. However, recent work in this field reveals that this 
assertion is not necessarily true. The design of efficient runtime systems PJ, so- 
metimes combined with aggressive compilation techniques | 2 |, has shown that 
concurrent object-oriented languages developed on top of such systems can at- 
tain sequential efficiency. Our approach is that current operating systems offer a 
set of facilities that can allow us to build efficient and portable runtime systems 
for concurrent object-oriented languages. In particular, we focus on multithrea- 
ding. Modern operating systems encourage programmers to use threads to ma- 



D. Caromel, R.R. Oldehoeft, and M. Tholburn (Eds.): ISCOPE’98, LNCS 1505, pp. 167-|17^ 1998. 
© Springer- Verlag Berlin Heidelberg 1998 



168 



A.J. Nebro, E. Pimentel, and J.M. Troya 



nage concurrent activities because they offer a number of advantages, such as 
computation and I/O overlapping, multiprocessing capabilities, and thread com- 
munication through shared memory. Our proposal is based on the application 
of distributed shared memory techniques (DSM) for dealing with distribution 
issues nn and the use of threads for implementing concurrent objects. The 
result is a runtime system which we call Lince. In this paper we describe the 
Lince system and measure its performance in several mono and multiprocessor 
systems. In particular, we focus on the analysis of the influence of multithreading 
on the cost of basic operations, such as object creation and method invocation, 
in the case of local objects. Lince’s performance in distributed systems is beyond 
the scope of this paper. 

A key point of this work is that multithreading can lead to simpler and 
portable runtime systems, giving over to the operating system some functions 
(e.g., object scheduling) at the cost of an overhead that could be acceptable 
for a wide number of applications. Although a number of threads API exist, 
the increasing availability of Pthreads and the fact that other popular threads 
packages (Solaris and Microsoft’s Win32 threads) share a common subset of 
basic features 0, allows us to consider threads as a choice to be taken into 
account. The basic design objectives of Lince are to: a) provide a platform to 
write concurrent object-oriented programs, but offering services that can be used 
by compilers; b) facilitate the portability of parallel applications to a broad range 
of platforms, and; c) achieve good performance. 

The rest of the paper is organized as follows. In Sect. 2, we discuss the object 
model assumed in this work. The architecture of the Lince system is described 
in Sect. 3. The next section covers implementation details and the evaluation of 
the system. Related work is discussed in Sect. 5. Finally, Sect. 6 presents some 
conclusions and discussion of future work. 

2 Concurrent Objects in Lince 

Our work has to take place in the context of an object model that defines the 
structure of objects and their behavior. We have chosen a generic and simplified 
object model. We define a concurrent object as an entity that has an internal 
state composed of a set of hidden variables, and a public interface composed 
of a set of operations. State variables can only be accessed through operation 
invocation. Object activities are reduced to invoking other object operations, 
creating additional objects and modifying its internal state because an operation 
has been performed. Examples of languages that fit into this basic object model 
are those based on the actor model 0, like ABCL 0 s.nd HAL j^j. 

The proposed object model extends the basic asynchronous operation in- 
vocation scheme of the actor model also allowing synchronous invocations. We 
distinguish two kinds of operations: commands and queries. A command is an 
operation that modifies the internal state of the invoked object, but without re- 
turning information about it. A query is an operation that returns information 
about the object, but without modifying its internal state. Commands and que- 
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Fig. 1. Architecture of the implementation 



ries present different behavior: commands are asynchronous operations, while 
queries are synchronous operations. According to this distinction, objects syn- 
chronize when a query is invoked, according to a wait-by-necessity mechanism 
0. This concurrency model fits into Meyer’s proposal of integrating concurrency 
and object-orientation m- 

The Lince system relies on multithreading capabilities supplied by the ope- 
rating system to tackle object management and communications in the case of 
local objects. We believe, unlike some previous works HH, that current thread 
packages can provide a portable and efficient abstract layer. Modern operating 
system threads support reduces the cost of multithreading while offering additio- 
nal advantages such as exploiting multiprocessing in a transparent way. Features 
like memory sharing enables the use of aggressive compiler techniques such as 
inlining or speculative optimizations. 

The application of DSM techniques in Lince allows us to modify object loca- 
tion dynamically by replicating or migrating objects. The Lince implementation 
scheme adopts the entry memory consistency model H2], that reduces com- 
munication latency while requiring a smaller number of messages than other 
consistency models. The drawback is that method invocations must be enclosed 
between a pair of acquire and release operations on a lock that is associated to 
the replicated object. Acquire operations can be exclusive or shared, so objects 
are accessed according to a multiple-reader/single- writer scheme. To avoid in- 
creasing the complexity of the programming model, this invocation scheme is 
applied to all the objects, replicated or not. 

3 Lince Multithreaded Architecture 

The architecture of the current implementation is shown in Fig. E In each node 
there is a process that includes the runtime system and the concurrent objects. 
Objects are implemented using one thread per object, and each thread execute 
the same scheduling function. The runtime system is composed of the object 
table, the communication agent (CA), and a set of structures, like the node 
identifier and the host table. Each entry in the object table is a pointer to an 
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Table 1. Operating systems and hardware configurations 



Computer 


OS 


Threads 


Processor) 


Clock 


Memory 


Sun Ultra 1 


Solaris 2.5 


Solaris 


UltraSPARC I 


143MHz 


64MB 


PC 


NT 4.0 Workst. 


Win32 


Pentium MMX 


200MHz 


64MB 


SGI 02K 


IRIX 6.4 


Pthreads 


MIPS RIOOOO 


196MHz 


4GB 


Alphas vr 4100 


DECUNIX 4.0D 


Pthreads 


Alpha 21140-AA 


300MHz 


256MB 



object handler, a data structure that holds object related data such as: thread 
identifier, the incoming message queue, a lock for ensuring mutual exclusion, 
and a pointer to a reply box. The CA is an entity whose main aim is to receive 
messages from objects in the other nodes and deliver them to the target objects. 
When objects are local they communicate without using the CA, by putting the 
messages directly into the invoked object’s queue. 

The Lince system provides a set of services that are used through object 
handlers. These services are classified in object management (basic services for 
object creation and operation invocation), object information (a set of calls that 
report information about an object, like its locality or its state), and optimi- 
zed services. These last ones are optimized calls that replace some of the basic 
services. For example, when only an isolated command or query is going to be 
invoked, instead of issuing three operations (acquire -I- command/query -|- re- 
lease), the system provides services such as CommandAcquire or QueryAcquire, 
that produce the same effect but only requiring two messages. 

Although the optimized services improve the performance of the general in- 
voking scheme by requiring fewer messages, the communication costs are still far 
from the objective of sequential efficiency when objects are local. Nevertheless, 
the use of threads for implementing concurrent objects allows the employment 
of techniques such as inlining. Thus, an operation invocation can be replaced 
by a simpler piece of code, like the code of the operation (the function call and 
return are eliminated) or even the state variables of the object. 

4 Implementation Details and Evaluation 

Our current implementation of the Lince system is a prototype written in C-| — h 
1E|. We have used three different thread packages: Pthreads PI. Solaris threads 
PI and Win32 threads m- At present, there is a port to the following opera- 
ting systems: Solaris 2.X, Digital Unix 4.0, IRIX 6.4, and Windows NT/95. We 
evaluated the system on the platforms shown in Tabled 

4.1 Basic Services Evaluation 

In this section we show the costs of basic operations, such as object creation and 
operation invocation. 

TableElshows the cost of object creation in these systems. This process inclu- 
des obtaining a new object handler and invoking the CreateObject service. The 
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Table 2. Costs of basic operations (in microseconds) 





AlphaServer 4100 


Origin-2000 


Ultra 1 


PC 


Object creation 


970 


1600 


4300 


1600 


Acquire-!- Query -I- Release 


52 


42 


234 


130 


QueryAcquire 


46 


30 


160 


80 


Query (already acquired) 


0.8 


1 


1 


10 



times obtained depend directly on the primitives provided for thread creation, 
and are about three orders of magnitude worse than those obtained in other 
works PP. These results indicate that the thread approach is not suitable for 
intensive fine-grained object-oriented computing, but they could be considered 
acceptable if most of the fine-grained objects in the application are long-lived. 
We also show the cost of invoking a query operation on a local concurrent in- 
teger object using the acquire-query-release scheme and the optimized service 
QueryAcquire, and the cost of invoking a query when the object is already ac- 
quired. These results indicate that in the worst case the cost of a local invocation 
is in the order of tens of microseconds. 



4.2 Analysis of a Matrix Multiply Program 

We have coded a parallel matrix multiply program to measure the impact of 
increasing the number of objects while decreasing the grain of computations they 
have to perform. The parallel algorithm is based on dividing the result matrix 
into submatrices that can be computed in parallel by multiplier objects. For the 
sake of simplicity, the program multiplies square matrices of floats, and the result 
matrix is divide into 4^ submatrices. Thus, by varying N from 0 to 4, the same 
computation will be carried out by a single process that contains between 1 and 
256 multiplier objects. As the matrix objects to be multiplied receive mostly read 
operations, methods for accessing matrix elements are inlined. Thus, multiplier 
objects can directly access the internal state of the matrix objects. 

Baseline results for the matrix multiply program executions, multiplying two 
256 and two 1024 matrices, using one processor, appear in Table 01 The times 
include the creation of the multiplier objects, operation invocations and termi- 
nation detection. 

At first glance, the results of multiplying the 1024 matrices are a bit surpri- 
sing, because the times tend to decrease when increasing the number of threads, 
and in each execution the number of computations remains constant and we have 
to add the overhead of using threads. It is difficult to explain this behavior, but 
we must consider several issues in order to draw a conclusion. First, sequential 
matrix multiplication is a processor-bound process which has been traditionally 
penalized by the scheduling policy of UNIX systems (and Windows NT). The- 
refore, it is possible that the multithreaded version behaves better in terms of 
scheduling. Second, taking into account that the executions take tens of seconds, 
the cost of creating up to 256 objects is negligible. This is not the case of the 256 
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Table 3. Times obtained multiplying two 256 and two 1024 square float matrices using 
one processor (in seconds) 





256 X 256 


1024 X 1024 


AlphaServer 


Origin 


Ultra 1 


PC 


AlphaServer 


Origin 


Ultra 1 


PC 


Sequential 


0.73 


0.51 


1.54 


1.15 


49.5 


35.4 


117.8 


85.7 


1 Mult. 


0.78 


0.34 


1.32 


1.12 


59.5 


25.7 


98.2 


81.4 


4 Mults. 


0.77 


0.41 


1.32 


1.12 


59.6 


25.6 


97.8 


81.4 


16 Mults. 


0.78 


0.55 


1.41 


1.13 


59.5 


23.6 


97.0 


81.4 


64 Mults. 


0.82 


2.82 


1.48 


1.21 


59.5 


22.8 


95.2 


78.7 


256 Mults. 


1.02 


1.14 


2.16 


1.51 


59.6 


23.0 


90.5 


73.5 



matrix multiplication, where increasing the number of threads penalizes slightly 
performance up to 16 multipliers. When using a higher number of multipliers 
performance tends to degrade rapidly. 

We repeated the same experiments in the multiprocessor systems using all the 
available processors (Table 0) . Speedups were obtained by dividing the time of 
the execution with one multiplier object. We observed that in the AlphaServer the 
speedups were very similar and close to linear in the 1024 matrix multiplication, 
but they decrease in the 256 matrix experiments. In the case of the Origin-2000, 
the results are not very good. A reason can be related to the scheduling policy of 
the IRIX operating system, which seems to create the new threads in the same 
node that the creating thread and then it balances the load dinamycally. As 
the grain of computations of the 256 matrix multiplication is small, the most of 
computations will be executed probably using fewer processors than available. 



5 Related Work 

This section reviews previous works related to run-time support for implemen- 
ting concurrent objects. The Illinois Concert system combines aggressive com- 
piler and runtime techniques for implementing fine-grained concurrent object- 
oriented programs in sequential and parallel systems 0 . Concert has demonstra- 
ted sequential performance in several benchmarks HD- Our work differs mainly 
in that, to achieve portability and take advantage of multiprocessing, Lince is 
implemented using OS threads. 

StackThreads is a runtime system that has been used to implement ABCL, 
one of the first concurrent-00 programming languages PUS] Like Concert, 
the goals of StackThreads are in obtaining sequential efficiency, but focusing on 
runtime techniques and not on compiling analysis and optimization. No support 
for object migration or replication is provided, nor are they considered in the 
ABCL language. 

Several projects focus on portable runtime support for object-oriented langu- 
ages based on C-|— k, like Mentat and CHARM-k- k. Mentat is an object-oriented 
parallel system based on a dataflow computation model. Its runtime system is 
implemented using processes cni, so Mentat is mainly suitable for medium-to- 
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Table 4. Times (in seconds) and speedups obtained multiplying two 256 and two 1024 
square float matrices in the multiprocessor systems 





256 X 256 


1024 X 1024 


AlphaServer 4100 
(4 processors) 


Origin-2000 
(16 processors) 


AlphaServer 4100 
(4 processors) 


Origin-2000 
(16 processors) 


Time 


Speedup 


Time 


Speedup 


Time 


Speedup 


Time 


Speedup 


1 Mult. 


0.76 


1 


0.34 


1 


59.7 


1 


23.7 


1 


4 Mults. 


0.20 


3.8 


0.27 


1.25 


15.2 


3.93 


7.4 


3.20 


16 Mults. 


0.22 


3.4 


0.15 


2.42 


15.4 


3.87 


3.0 


7.90 


64 Mults. 


0.24 


3.2 


0.16 


2.12 


15.2 


3.93 


2.2 


10.77 


256 Mults. 


0.47 


1.5 


0.59 


0.57 


15.2 


3.93 


2.4 


9.87 



coarse-grain applications. CHARM++ m is a C++ extension that classifies 
objects into sequential, concurrent, replicated, shared, and communication ob- 
jects. In Lince we adopted a more uniform object model, where all the objects 
are concurrent, shared, and can be replicated. 



6 Conclusions and Future Work 



We have described Lince, a multithreaded runtime system for implementing 
concurrent object-oriented languages. Implementing concurrent objects using 
threads provided by modern operating systems allows simplifying system design 
and taking advantage of multiprocessing capabilities. Portability is enhanced 
because most thread packages offer a common subset of features. The system 
is based on an object model that is suitable for replicating objects, so the pro- 
gramming model imposes several requirements, such as explicitly acquiring and 
releasing objects, and the distinction between command and query operations. 

Lince performance measurements show that creating objects is expensive, 
while the overhead of method invocation is about several tens of microseconds 
in the worst case. Results of a parallel matrix multiply program reveal that 
increasing the number of threads while decreasing granularity can lead to better 
performance, contrary to expectations. 

We conclude that the overhead of using threads for implementing concurrent 
objects can be acceptable for medium and coarse applications, although it is 
too expensive for those requiring many fine-grained objects. Tuning the Lince 
runtime system for better performance and building representative application 
workloads are topics of future work. 
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Abstract. We present a C++ template run-time library, Promoter, 
and discuss run-time support for data-parallel applications. The Pro- 
moter run-time library provides a uniform framework for data-parallel 
applications, covering a broad spectrum of granularity, regularity and 
dynamicity. It supports user-defined data structures ranging from dense 
to sparse arrays, regular to irregular index structures and data distribu- 
tions. The object-oriented design and implementation of the Promoter 
run-time library not only provides an easy data-parallel programming en- 
vironment, but also leads to an efficient implementation of data-parallel 
applications through object reuse and object specialization. 



1 Introduction 

A frequently used model for developing parallel applications on distributed- 
memory multiprocessors is the data-parallel programming model in which paral- 
lelism is achieved by partitioning large data sets between processors, and having 
each processor work only on its local data. 

In this paper we discuss the run-time support for data-parallel applicati- 
ons provided by the Promoter run-time library (Prl). By the object-oriented 
design principle, the Prl provides a uniform interface to support data-parallel 
applications on both dense and sparse arrays or data structures. By sparse arrays 
we mean not only regular ones but also irregular ones. A regular sparse array 
may have an index set with a regular scheme, like a band matrix, while in an 
irregular sparse array indices are generated irregularly at run-time. A uniform 
interface is necessary because in many scientific applications, operations on both 
dense and sparse arrays are coexisting. 

In the following, we first present a run-time model for data distribution in 
which data distribution descriptors as run-time support are introduced. Then we 
discuss how to use this run-time support in computation and communication. 
Finally, we analyze some performance results, compare our approach with related 
works, and give the concluding remarks. 

2 Data Distribution 

Our approach assumes that a set of (virtual) Spmd processes runs in parallel on 
a distributed-memory multiprocessor. Each process has at least one control flow 

* This work is supported by the Real World Computing Partnership, Japan. 
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(possibly multi-threaded) and its own address space. In our terms, the address 
space of a process is called a domain. Each process can only access elements on 
its own domain, and can communicate with other processes by receiving and 
sending elements from and to other domains. 

2.1 Data Distribution Descriptor 

Most scientific applications are using dense and/or sparse arrays for modeling 
their data structures. Arrays are always indexed by so-called subscripts. The 
space containing all subscripts is called index space for that array. An index 
space describes the spatial structure of an array. 

In a parallel implementation, because of data distribution (the partitioning 
of data between processors) there are two kinds of index spaces. A global index 
space with global subscripts describes the spatial structure of an array in a 
sequential context, while a local index space with local subscripts describes the 
spatial structure of an array in a parallel context (that is, after applying data 
distribution). In our work, the global index space should be provided by users, 
because it is problem-oriented. The local index space and its relation to the global 
index space are maintained by the Prl in so-called data distribution descriptors. 
They are defined as Distribution classes with the following structure: 

— a map method that returns a domain number for any global subscript 

— a transform method that returns a local subscript for any global subscript 
belonging to the local domain 

— an Iterator class to iterate over all indices that belong to the local domain, 
and from which the global and local subscripts of the current index can be 
retrieved 

— an Allocator super-class that provides information for the allocation of local 
memory. 

A map method is normally implemented by an arithmetic function. In this 
way, the Prl does not restrict mapping strategies to some pre-defined ones such 
as block-cyclic mapping. 

The transform method and the Iterator class can be simply implemented for 
(regular) dense arrays. For (irregular) sparse arrays, they have to be implemented 
by searching a table to get results in the worst case. In our approach, we only 
require the local information from the transform method and the Iterator class. 
In other words, the data distribution descriptor itself is distributed. In this way, 
the worst case invokes a search only restricted to the local index space and its 
performance related to memory size and search time is scalable. 

An Allocator class determines how the local data set should be allocated. Prl 
provides different Allocator classes to allocate local data in the form of vector, 
matrix, binary tree or hash table. 

The current Prl provides a broad spectrum of Distribution classes for dense 
arrays and sparse arrays with different mapping strategies (see Fig. [Q. 

Users or compilers can select a suitable distribution class according to the 
problem to be solved. One of the most distinguished features of the Prl is that 
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Fig. 1. Distribution Classes 



all data distribution descriptors in it have a uniform type interface. The Prl 
is realized by class and function templates that are parameterized over data 
types such as Distribution. Different data distribution descriptors, with respect 
to different natures of arrays and their mapping strategies, can be implemented 
and then used in the Prl. They only have to follow the type interface defined by 
the Prl. So it is open for users or compilers to provide their own implementation 
of Distribution and other classes. 

The Distribution classes can be implemented as Stl adaptors. The Stl con- 
tainer and iterator classes cannot be directly used like the Distribution classes 
because the problem of data distribution should be dealt with explicitly. 

2.2 Distributed Object 

A distributed object is a collection of data elements which are partitioned and 
then allocated on local memory of a distributed-memory multiprocessor. 

template<class DT, class ET> Disob j { 
public : 

Disobj (const DT& t) ; 

~Disobj 0 ; 

ET& operator [] (const int index) const; 

>; 

A distributed object is defined as a class template over a Distribution class 
DT and an element type ET, with the meaning that the data elements are of 
the type ET and are partitioned and allocated according to the Distribution 
class DT. The operator [ ] provides access to local data elements through a local 
subscript. 

With a uniform interface to data distribution descriptors, Disobj classes can 
be implemented as generic class templates. In fact, the implementation of Disobj 
classes differs only on different Allocator classes. That is, the Prl provides the 
corresponding Disobj class templates with respect to different Allocator classes. 

Data distribution is supported by providing descriptors and carriers respec- 
tively. We can benefit from this decomposition in two aspects: descriptors can be 
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implemented differently (that is, specially optimized) for different kinds of ar- 
rays and their mapping strategies, and one descriptor may be shared by different 
Disob j objects to increase efficiency. 

3 Parallel Computation 

Parallel computation can be divided into two subtasks: work distribution and 
local operation. Work distribution means to divide operations that can be exe- 
cuted in parallel across processors. Such an operation may be initially described 
by a loop over global subscripts in a global index space. Work distribution is 
performed in two steps. In the first step, the loop is divided into a set of local 
loops with respect to the local domain. In the second step, all global subscripts 
involved are transformed to local subscripts. 

A loop over a global index space can be transformed to a loop over a local 
index space by calling the iterator of the Distribution class of a target Disobj 
object. At each iteration, the current subscripts (both global and local) are 
obtained. The global subscript is then used in a test within the loop to filter 
unnecessary operations as defined by the original loop, and the local subscript 
is used to access the target data element. For all source Disobj objects, each 
global subscript is computed as a function of the current global subscript of the 
target element. Then the corresponding local subscript is found by calling the 
transform method in a Distribution class. Non-distributed objects can also be 
used in parallel computation, as they are duplicated at each domain. 

The above implementation can be further improved by the following two op- 
timizations: First, since the work distribution described above is only dependent 
on data distribution descriptors, it is possible to collect all local subscripts for 
an operation in the beginning and then reuse them multiple times, e.g., within a 
loop. Second, because in the collection of local subscripts, we only use an Iterator 
class from a Distribution class. It is possible to define some special Iterator clas- 
ses which only iterate over a subset of local indices with respect to a Distribution 
class, for example, a row, a column, or a diagonal from a matrix. 

4 Communication 

Communication in data-parallel applications can be divided into two kinds: 
point-to-point and collective. Based on the run-time support provided by a Dis- 
tribution class, the point-to-point communication can easily be implemented. 
Each process checks if it is the owner of a source element or a target element by 
calling the mapping method. If a process is the owner of source or target elements, 
a local subscript for that element is found by calling the method transform, and 
then send or receive routines are called to perform the communication. 

The collective communication is performed in the so-called inspector/executor 
paradigm|I|. First, an inspector routine is called to build a communication sche- 
dule that describes the required data motion, and then an executor routine is 
called to perform the data motion (sends and receives). The communication 
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scheduling phase must be performed at run-time for dense arrays, if some para- 
meters (rank of arrays, loop parameters, or the number of available processors) 
or data accesses are unknown at compile-time. For sparse arrays, even the spatial 
structure is unknown at compile-time, and communication scheduling must be 
performed at run-time. 



4.1 Communication Scheduling 



The Prl provides a set of generic communication scheduling routines through 
function templates over Distribution classes. We define communication schedu- 
ling to be a transformation from a communication pattern to a communication 
schedule, with respect to the corresponding Distribution classes. Communica- 
tion patterns describe the required data motion by global subscripts, while the 
communication schedules describe the required data motion by local subscripts. 
Expression of communication patterns is simple, because it does not involve 
data distribution details. Therefore, we require users to express communication 
patterns in a problem-oriented way, and the Prl provides run-time support to 
generate the corresponding communication schedules. In this way, communica- 
tion scheduling provides the encapsulation of data distribution details related to 
communication . 

Different communication scheduling routines are provided according to diffe- 
rent communication patterns: one-to-one, one-to-many(e.g. gather or reduction), 
and many-to-one(e. g. scatter or expansion). For each pattern, there are three 
possible representations: enumerated, functional, and dimensional. 



// 

// 

// 

// 

// 



Communication scheduling 

cp is a one-to-many communication pattern in a functional 
representation 

cs is the communication schedule for the communication from 
y to y defined by cp 
Sparse_Array x; 

Mapping CY_BL0CK; 

Distribution y(x, CY_BL0CK) ; 

Comm_Pattern cp = {<i,j> => <i, j+l>,<i, j-l>,<i-l, j>,<i+l, j>}; 
Comm_Schedule cs = communication_scheduling(y , y, cp) ; 



The communication scheduling can be implemented in parallel by using the 
run-time support provided by data distribution descriptors. There are two kinds 
of implementation schemes: sender- or receiver-initiated. At the local domain, 
sender/receiver-initiated communication scheduling collects all local subscripts 
to be sent/received to/from other domains, and all global subscripts to be recei- 
ved/sent from/to other domains. It then transfers the collected global subscripts 
to the corresponding receivers/senders, and finally transforms the received global 
subscripts to local subscripts. 

The above generic implementation is scalable, if a good mapping strategy 
exploits locality. More than that, the results of communication scheduling can 
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be reused (e.g. within a loop), because the communication scheduling is only 
dependent on data distribution descriptors, not on distributed objects. A lot of 
scientific and engineering problems are solved by so-called iterative methods in 
which communication schedules can be reused naturally. 



4.2 Data Motion 

In the Prl, three kinds of data motion routines that are generic with respect 
to Disobj classes and a communication schedule are provided, for one-to-one, 
one-to-many, and many-to-one communication patterns (schedules). In these 
routines, communication with computation can be overlapped by providing a 
user-definable function object. The Prl supports the overlapping of communi- 
cation and computation by the owner-computation rule. A user-definable fun- 
ction objects defines operations between the target data elements and received 
source data elements after communication in a flexible way. For example, we can 
define = or -|-= for one-to-many communication, and -|-=, *=, max or min for 
many-to-one communication. 

// Overlapped communication and computation by reuse of 
// communication scheduling 

Disobj<Distribution, double> tar(y), src(y); 

ADD_0P fun; // function object for reduction 

forCint i=l; i<No_lter; i++) 

data_motion(tar , src, cs, fun); 

The data motion routines are provided as function templates which are gene- 
ric with respect to communication schedules. Special or optimized communica- 
tion scheduling can be implemented and then their results can be used by these 
data motion routines, if the resulting communication schedules are conform with 
the uniform type interface defined by the Prl. 

The data motion routines encapsulate message passing details such as crea- 
tion of a communication buffer and its management, communication and syn- 
chronization, and data packing and unpacking. They are implemented by an 
underlying message-passing library. It provides asynchronous send and receive 
operations, synchronization operations, and buffer management. Currently, the 
Prl is implemented on top of Mpi, Pvm and some native communication pack- 
ages respectively. 

5 Performance 

Our library is developed in the PROMOTER[2j project and is used as a user-level 
library, and a run-time library for the Promoter compiler. The system was 
developed on our testbed system Manna, and has been ported to Ibm Sp/2, 
Cray T3e, Hitachi Sr2201, and to a cluster of workstations using either 
Ethernet or the Myrinet communication hardware. 
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Within the project, many applications (finite element methods, heat con- 
duction, computational fiuid dynamics, elasticity) have been tested. We have 
also implemented some benchmark programs using the run-time library, and 
have compared the performance of our implementation with hand-written and 
optimized implementations based on Mpi. 

Table 0 shows the time of the Cg and Fft programs in the Naspar bench- 
marks running on the Hitachi Sr 2201 machine. The Cg program has sparse 
matrices, the Fft program has only dense arrays. 



Table 1. Wall-clock run time of NASPAR benchmarks (seconds) 



Benchmarks 


8 PEs 


16 PEs 


32 PEs 


64 PEs 


CG (A/B) 


28.04/ 814.54 


16.83/437.03 


9.90/223.96 




FFT (A/B) 




35.88/79.50 


18.02/39.23 


9.399/20.69 



The performance is achieved by applying the specialization and reuse of com- 
munication schedules on the programs using the Prl. Our approach achieves 
approximately 90%-100% of the performance of the hand-written benchmarks 
that directly use Mpi. 

6 Related Work 

Much work has been carried out in run-time support for data-parallel applicati- 
ons. One of the pioneering efforts in run-time support for data-parallel applica- 
tions is the development of a series of run-time libraries: Multiblock Parti, 
Chaos, and ChaosH — h p. They have three different interfaces for dense arrays 
and sparse arrays with regular and irregular data distributions. They use virtual 
functions in the implementation. The Prl has a uniform interface and uses only 
class templates and function templates in a way that their invocations can be 
resolved at compile-time. 

There are many parallel C++ efforts such as Icc++ | 3 , C** 0 , Pc++ 
||5|, Mpc-| — hjS] and, Hpc-| — which must also deal with run-time support for 
data-parallel applications. Usually, only regular dense arrays are considered. In 
the Illinois Concert System P|, Icc-I — \- expresses (irregular) data paralle- 
lism as task-level concurrency. The Prl not only expresses regular and irregular 
data parallelism, but also expresses task-level concurrency as data parallelism. 
For example, a tree or a graph can be expressed by a sparse array by assigning 
a unique subscript to each node. This results in easy generation of collective 
communication, which can be achieved in the Illinois Concert System only by 
comprehensive compile-time analysis. 

Another similar work is POOMA jS|. The POOMA framework is constructed in 
a layered fashion, in order to exploit the efficient implementation on the lower le- 
vels, while preserving an interface germane to the application problem domains 
at the highest level. However, PoOMA is not a general-purpose programming 
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environment, because it is motivated by specific applications. The Prl is de- 
signed to be a generic-purpose user-level library for data-parallel applications, 
and also as a run-time library to support the Promoter compilation system 
for a general-purpose data-parallel language. 

7 Conclusions 

The Prl provides object-oriented run-time support for regular dense and irre- 
gular sparse structures for an easy and efficient data-parallel programming with 
the following features. Data distribution details are decoupled from data by in- 
troducing data distribution descriptors. Data distribution descriptors themselves 
are distributed for scalability. They have a uniform interface for both, dense and 
sparse arrays, allowing a generic implementation of distributed data. 

With the help of these descriptors, operations on distributed data can be 
easily mapped from a specification through global subscripts (sequential exe- 
cution) to a specification through local subscripts (Spmd execution) by work 
distribution and communication scheduling. 

Descriptors, work distribution and communication scheduling are provided 
in generic form for fast prototyping, and can be optimized or specialized for an 
efficient implementation. Descriptors can be shared, and results of work distri- 
bution and communication scheduling can be reused at run time for efficiency. 
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Abstract. Los Alamos National Laboratory’s Tecolote Framework is 
used in conjunction with other libraries by several physical simulations. 
This paper briefly describes the design and use of Tecolote’s component 
architecture. A component is a C++ class that meets several require- 
ments imposed by the framework to increase its reusability, configurabi- 
lity, and ease of replacement. We discuss both the motives for imposing 
these requirements upon components and the means by which a generic 
C++ class may be integrated into Tecolote by satisfying these require- 
ments. We also describe the means by which these components may be 
combined into a physics application. 



1 Introduction 

Los Alamos National Laboratory’s Blanca project is part of the Department 
of Energy’s Accelerated Strategic Computing Initiative (ASCI), which focu- 
ses on science-based nuclear weapons stockpile stewardship through the large- 
scale simulation of multi-physics, multi-dimensional, stockpile-relevant problems. 
Blanca is the only Los Alamos ASCI project written entirely in C++. Tecolote, 
the underlying framework for the development of Blanca physics codes, pro- 
vides an infrastructure for combining individual component modules to create 
large-scale applications that encompass a wide variety of physics models, nu- 
merical solution options, and underlying data storage schemes, activating only 
essential components at run-time Tecolote maximizes code re-use and sepa- 
rates physics from computer science as much as possible. This allows physics 
model developers to use the Parallel Object-Oriented Methods and Applications 
(POOMA) framework, upon which Tecolote is layered, to write algorithms in a 
style similar to the problem’s underlying computational physics equations |2- 
POOMA contains architecture and parallelism abstractions that allow the 
user to write parallel physics codes without worrying about the underlying archi- 
tecture or communications libraries. POOMA provides C++ fields that are simi- 
lar to Fortran-90 arrays, but have additional features, including domain decom- 
position, load balancing, communications, and compact data storage. POOMA’s 
unique capabilities provide the methods developer with powerful tools for ex- 
pressing various mesh types and multiple dimensions; this allows application 

* This work was performed under the auspices of the U.S. Department of Energy by 
Los Alamos National Laboratory under Contract No. W-7405-Eng-36. 
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developers to write mesh- and dimension-independent physics code whenever 
possible. Combined with Tecolote, POOMA’s flexibility allows us to keep pace 
with the ever-changing ASCI environment, rapidly prototype ideas, and build 
on what others have done rather than using valuable time to reimplement the 
same basic models on different architectures. 

Tecolote is portable to all ASCI-relevant hardware, making full use of its 
available parallelism. By supporting the rapid implementation of physics models 
and their immediate application to problems on the ASCI scale, Tecolote pro- 
vides a powerful and flexible run-time environment that allows users to create 
and compose physics codes with varying capabilities “on the fly.” 



2 Approach 

As we will discuss later, the Tecolote Framework supplies an application pro- 
grammer interface that supports factorization of applications into components. 
Factorizing an application enhances the programmer’s ability to cleanly separate 
interfaces from implementations, encapsulate conceptually independent subparts 
of a program, avoid code duplication, and maximize code reuse. After facto- 
rization, the user can integrate desired components using techniques supplied 
by Tecolote. Key concepts supported by this component architecture include 
separation of computer science from physics in simulations, implementation- 
independence of component interfaces, and increased run-time configurability 
(through an input-file scripting language). This flexible approach is made pos- 
sible through the use of C-|— I- inheritance and virtual function polymorphism. 
However, a Tecolote component’s increased modularity comes at a price: crea- 
tion of and communication between components can be more expensive than 
corresponding operations for ordinary C-|— I- objects. Thus, the components’ gra- 
nularity must be large enough that these operations do not impact the code’s 
efficiency. 

Tecolote facilitates factorization through several mechanisms: 

— uniform run-time data-sharing interface (the DataDirectory) with dynamic 
scoping rules; 

— facility for the run-time description of components’ type and inheritance 
relations (MetaType) and the registration of this information in a single type 
table (MetaSet); 

— means of configuring both data values and control flow without recompilation 
(the input file scripting language); and 

— separation of I/O and computation through the designation of “persistent” 
data intended for I/O. 

The remainder of this paper illustrates the above-described mechanisms and 
their interaction through the extended example of a gamma-law equation of state 
(EOS) model. 




Component Architecture of the Tecolote Framework 185 



3 Setting Component Parameters - Persistents 

In object-oriented programming, I/O methods are generally encapsulated in ap- 
plication classes. To adopt a new I/O format, whether binary instead of ASCII 
text or eight-digit instead of six-digit floating-point output, every application 
class must be modifled. 

In Tecolote, we separate what is needed for I/O from how I/O is executed. 
The “what” is specified by the application programmer in a persistent list that 
contains the class data members available for I/O. Another component, an I/O 
module, determines how I/O is actually executed. The I/O module knows how to 
extract persistent locations from objects’ MetaTypes and how to perform some 
type of I/O operation. There may be several different I/O modules in a system, 
each corresponding to a different data format. 

For example, GamimaLaw class members are 

REAL pmin; // minimum pressure 

REAL gamma; // adiabatic garnima 

The persistents are listed outside the class declaration: 

template< class C > 

BEGIN_PERSISTENT( GammaLaw< C > ) 

PERSISTENT ( REAL, pmin, "pmin" ) 

PERSISTENT ( REAL, gamma, "gamma" ) 

END_PERSISTENT 

Persistents support factorization by separating I/O modules from applica- 
tion modules and by deferring decisions about data initialization until run-time. 
Factoring I/O out of application objects ensures consistent input and output 
formatting with little burden on the application programmer; it also localizes 
changes required to support new data formats. 

4 Sharing Fields Between Components - DataDirectory 

Two models may use different (or the same) fields to compute their respective 
results. However, because we want to use virtual-function polymorphism to call 
the two models interchangeably, the models must use the same calling sequence 
in their respective evaluation functions. To avoid passing different fields in the 
argument lists of the evaluation functions, we have developed another alterna- 
tive: passing a single data structure to the EOS model constructor, which then 
holds all the fields needed for a material. This data structure is termed the 
DataDirectory. 

A DataDirectory is actually just like any other Tecolote component, ex- 
cept that it may have any number of persistents with any names. In contrast, 
ordinary components may only contain the persistents specified in their persi- 
stent lists. Any object, including another DataDirectory, may be placed inside 
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a DataDirectory. Thus the DataDirectory structure is hierarchical, much like a 
Unix directory structure. Two entries are automatically put in a newly created 
DataDirectory to allow traversal of the hierarchy: “Root,” which points to the 
DataDirectory at the top of the hierarchy, and “Parent,” which points to the 
immediate predecessor in the hierarchy of the current DataDirectory. 
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Mesh 








MaterialSet 













MaterialSet 



Root 
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Fig. 1. The DataDirectory hierarchy 



Figure n shows the DataDirectory hierarchy used for a multi-material simu- 
lation (for simplicity, only a few POOMA Fields are shown here). Unlike a Unix 
file structure, the DataDirectory structure has scoping rules similar to those of 
C-|— I- inheritance. However, whereas C-|— I- scope is determined by compile-time 
class inheritance relations, DataDirectory scope is determined by run-time ob- 
ject nesting. For instance, when GammaLaw attempts to get the PhysicsMesh from 
the “Material” DataDirectory, it fails to find it. Therefore, the search conti- 
nues up the hierarchy, examining the “Material Set” and “Root” directories and 
terminating after it finds the requested PhysicsMesh in the “Root” directory. 

The example below, from the GammaLaw class, illustrates the use of the 
DataDirectory macro GET: 
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ScalarField<C>& IntEnergyC 

GETC'IntEnergy" , Mat, ScalarField<C> , (Mesh)) 

); 

The first argument is the name of the requested item; the second is the 
DataDirectory in which the search starts (Mat[erial] is a DataDirectory); the 
third is the type of the DataDirectory item; and the fourth (if present) repre- 
sents the constructor arguments that are needed if the item is not present and 
must be added to the DataDirectory. The GET macro returns a reference to the 
object found in the DataDirectory. 

The DataDirectory improves code factoring because a single data structure 
is passed to methods that otherwise would use different calling sequences. By 
deferring the association of data with a particular model until its actual in- 
stantiation, the DataDirectory structure provides generalized parameter-passing 
without explicit parameter lists. It enhances integration by supplying a mecha- 
nism for transparent data sharing among independent modules. 

5 Building Components from an Input File - Scripting 

At the start of its execution, a Tecolote program must specify which modules 
to use, their initial data values, and the high-level control structure in which 
they are applied. Tecolote uses a different component for each option and em- 
ploys persistents to fill in the data needed by the options. Therefore, a Tecolote 
program built from components must use a methodology that creates objects 
from its components and places persistent data in the objects. We incorporate 
a scripting language into the Tecolote framework to accomplish these tasks. 

Each object is described by an object name, a MetaType name, and a list of 
persistent values. Object hierarchy (or nesting) is indicated by listing one object 
in the persistent list of another. The following example shows nested objects 
where a GaimnaLaw is created as the Eos persistent of a Material. An object’s 
constructor is called before its persistents are loaded. Therefore, an optional 
initialize function can be called after an object’s persistents have been loaded to 
perform further initialization that is dependent on persistent values. 

gas = Material! 

Eos = GaimnaLaw ( 
gamma = 0.5, 
pmin = 0.001 

) 

), 

In both debugging and actual use, it is desirable to change the control flow of 
a program without rebuilding the entire code. Tecolote provides this flexibility by 
allowing the user to specify higher level function sequences in the input file. The 
necessary facilities are already provided by Tecolote and may be extended by an 
application programmer through a MetaType that maps methods and functions 
into function objects available from the input file. 
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The language, based on Backus’ FP language and Robison’s IFP 0, inclu- 
des program- forming operations (PFOs), basic objects, and elementary operati- 
ons. The PFOs include control structures such as branching (If), looping (While), 
and sequences (Compose). Examples of basic objects are Lists, Numbers, and 
strings. Elementary operations that act on basic objects might include arith- 
metic, comparison, and collection functions such as concatenation and length. 
In Tecolote, as in FP, users may define new objects and functions, but not new 
PFOs. 

The input file describes the initial object hierarchy of a program. In addition, 
it selects which components will be used as well as the initial values of their per- 
sistents. The functional scripting language allows the programmer, by combining 
function objects using PFOs, to define an application’s high-level behavior. 

6 Registering a Component with Tecolote - MetaTypes 

In most applications, a module must directly reference other modules with which 
it interacts. This requirement obstructs factorization and prevents the applica- 
tion programmer from deferring module interactions until run-time. In addition, 
it is tedious to find all references to a module when replacing a module refe- 
renced in many locations within the program. In contrast, Tecolote components 
are registered only once - in the MetaSet, a table containing all the program 
components (see Fig. EJ. Individual components interact only indirectly, through 
the MetaSet, promoting component independence. 

A module is registered as a component with Tecolote by using a MetaType 
which, when invoked, automatically registers itself with the MetaSet. An object’s 
MetaType is an object in its own right, much like Java’s “Class” class 0. The 
MetaType for a class: 

— has a name in the MetaSet, 

— holds the persistent list for that class, 

" can create and initialize that class, 

— carries its C-| — h type information to run-time, and 

— can convert the MetaType from or to a single base class. 

Many languages create a unified type hierarchy by having a single base type 
from which all other classes are derived (Java’s Object class, for example). Teco- 
lote classes, on the other hand, do not share a common base class. Eliminating 
this restriction allows Tecolote to incorporate classes not written for the frame- 
work (such as standard container classes and C-|— I- basic types). Non- Tecolote 
classes are incorporated into the Tecolote type system by describing their basic 
features and their persistents in a MetaType. 

In the example below, the GammaLaw<Cell> class is registered with the fra- 
mework. 

#include "GcunmaLaw.hh" 

static MetaTecolote<GcmmiaLaw<Cell> , Eos> 

GairnnaLawMetaCGammaLaw" , MAKE_PERSISTENTS(GanimaLaw<Cell>) ) ; 



Component Architecture of the Tecolote Framework 189 



The generic MetaTecolote class is used to instantiate objects of classes with a 
constructor that takes DataDirectory and string arguments. Other MetaTypes 
are available that instantiate objects of classes with a constructor that takes no 
arguments. MetaTecolote is best used for classes that need more information ab- 
out their environment, while other MetaTypes are used for classes written without 
knowledge of Tecolote. The class GainmaLaw<Cell> is given the name “Gamma- 
Law” in the MetaSet and has the base class Eos. The MAKE_PERS I STENTS macro 
registers the persistent information defined for the GammaLaw class. 



MetaSet 




Fig. 2. Conceptual diagram of the Tecolote framework 



All MetaTypes in the system register themselves with the MetaSet, a table 
that may be searched either by the MetaType name or by the type of the class 
that the MetaType contains. Although all modules are registered identically, 
physics modules and I/O modules interact with the MetaSet differently. Physics 
modules do not make explicit use of the MetaSet; however, I/O modules use the 
MetaSet explicitly both to build object hierarchies from input and to output 
program data. One example of this interaction, mentioned above, is the use of 
persistent information to perform output. Input modules also use a MetaType’s 
constructor, initializer, and base class in creating the object hierarchy. 

Two I/O modules shown in Fig. 0 are the Parser and the Printer. The 
Parser reads data from the input file and then finds the MetaType by name in 
the MetaSet. After finding the MetaType, the Parser creates the object, loads its 
persistents, and calls its initializer. When an object is passed to the Printer, the 
Printer uses its G-l— I- type to find the corresponding MetaType in the MetaSet. 
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It then prints the MetaType name, scans the persistent list, and prints each 
sub-object in turn. 

MetaTypes and the MetaSet are key elements for supporting factorization in 
Tecolote. MetaTypes turn C-|— I- classes into components, and the MetaSet may 
be searched for any component in an application program, which provides a level 
of indirection between components. 

7 Conclusion 

By using the various techniques in the Tecolote Framework, the Los Alamos 
National Laboratory’s Blanca Project has been able to integrate a wide variety 
of physics packages into codes with relative ease. We add new physics models by 
modifying only the code in the MetaSet, without rewriting any of the code’s I/O 
modules. Complex data sharing is accomplished by the DataDirectory, which 
allows us to avoid complex calling sequences or global variables. By deferring 
the choice of components until they are specified in the input file at run-time, 
we ensure maximum flexibility in the code. The combined benefits from this 
component architecture approach ensure simpler, faster, and less error-prone 
means for adding new physics modules to a program. 

Our future work will address the effects of Tecolote’s component architecture 
and functional scripting lanugage. Potential areas of investigation include appli- 
cation programming methodologies in a functional scripting language, the high- 
level expression of parallelism in the component architecture (given functional- 
language guarantees), and the applicability of a functional language in specifying 
object hierarchies and application control flow. 
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Abstract. We discuss the parallelization and object-oriented implemen- 
tation of Monte Carlo simulations for physical problems. We present a 
C-|— I- Monte Carlo class library for the automatic parallelization of Monte 
Carlo simulations. Besides discussing the advantages of object-oriented 
design in the development of this library, we show examples how C-|— I- 
template techniques have allowed very generic but still optimal algo- 
rithms to be implemented for wide classes of problems. These parallel 
and object-oriented codes have allowed us to perform the largest quan- 
tum Monte Carlo simulations ever done in condensed matter physics. 



1 Introduction 

The Monte Carlo method P has been one of the most successful, if not the most 
successful numerical method in simulation of physical systems. Its applications 
span all length scales, ranging from large astrophysics simulations of galaxy 
clusters, to simulation of properties of solids and liquids, down to simulations of 
quarks and gluons, the constituents of protons and neutrons. 

In solid state physics usual Monte Carlo algorithms were easy to vectorize 
and ideally suited for vector supercomputers. However, in the most interesting 
cases, close to phase transitions, these “local” Monte Carlo algorithms suffer 
from so-called “critical slowing down,” which leads to an extra factor of in 
the CPU-time (L is the system size). Modern “cluster” algorithms |21 0] beat 
this slowing down, but one has to deal with much more complex data structures 
and with algorithms that do not vectorize well. 

In this paper we present how almost all kinds of Monte Carlo simulations, 
including the cluster algorithms, can be parallelized very efficiently and intro- 
duce a Monte Carlo class library and application framework that automatically 
performs this parallelization. Additionally, we present our experiences in using 
C-I-+ template techniques to write generic Monte Carlo programs for a wide 
class of model systems, and in using them for more than 600 years of CPU time 
on a wide variety of workstation clusters and massively parallel machines. 

2 Monte Carlo Simulations 

Monte Carlo simulations are the only useful way to evaluate high-dimensional in- 
tegrals. Such integrals are very common in the simulation of many-body systems. 
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For example, in a classical molecular dynamics simulation of M particles, the 
phase space has dimension 6M (3 coordinates each for positions and velocity). 

Usual numerical integration techniques are very slow for high-dimensional 
integrals. For example, with the Simpson rule in d dimensions with N equidi- 
stant points, the error decreases as For the corresponding Monte Carlo 

summation with N random points sampled with some distribution p{xi the 

integral is estimated by / f{x)dx = ^ J2f=i /(®i)/p(®*) and the statistical er- 
ror decreases as For d > 0(10) the Monte Carlo integration method is 

thus faster. 

Usually the points Xi are a Markov process a;i —>■ a ;2 a;i —>■.. .. Star- 

ting from a random configuration Xi the Markov process must be iterated for a 
certain number of equilibration steps before it produces N random samples 
having the correct probabilities. This will be important for the performance of 
the parallel implementation. For more details about Monte Carlo methods, es- 
pecially “importance sampling” and other techniques for reducing the statistical 
error, we refer to standard textbooks^. 

3 Parallelization and Performance of Monte Carlo 
Simulations 

Our typical Monte Carlo simulations are easy to parallelize at several levels of 
granularity: 

— Often we need many simulations for hundreds of different parameter sets 
(system sizes, temperatures, and so forth). Being independent, they can be 
parallelized trivially, with negligible overhead and almost perfect scaling, 
as little inter-processor communication is needed. For example, we found a 
speedup of 95.5 on 96 nodes of an Intel Paragon. For numbers of simulations 
larger than the number of available nodes, this level of parallelization is 
efficient. 

— For one Monte Carlo simulation, uncorrelated Markov chains {vecxi} of sta- 

tistical samples can be generated on different nodes by starting independent 
Monte Carlo runs with different random seeds. This level of parallelization 
however incurs a slight overhead, since each run needs to be equilibrated 
individually. On P nodes this leads to a theoretical maximal speedup of 
P(1 -I- Neq/N)/{1 + PNeq/N). Since typically N Ri this level of par- 

allelization scales well to 20 times more nodes than simulations. 

— Only if the equilibration time N^q is very long or if memory needs require 
it, is it worth parallelizing a single Monte Carlo run. For example, this is 
possible by distributing the particles in the simulation over different nodes. 
This is, however, rarely done because of communication overhead. 

We found, however, that the main bottleneck in scaling to a large number 
of nodes is caused by disk I/O needed at the beginning and end of each job 
(a simulation typically takes several weeks and thus has to be split into many 
separate jobs, requiring us to temporarily store the configurations on disk). Due 
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to limitations of the parallel file system of the Hitachi SR2201 we used, this time 
grows faster than the data size, as we increase the number of nodes (250 sec 
for 256 nodes, 780 sec for 512 nodes, and 2970 sec for 1024 nodes for a typical 
quantum Monte Carlo simulation). On the machine we used the CPU time for 
a large job is unfortunately limited to one hour per node, so a typical program 
scales well only up to 256 nodes, where the disk I/O overhead is of the order of 
15%. 



4 The Alea Application Framework and Class Library 

Monte Carlo simulations typically need a large amount of CPU time but fortun- 
ately, parallelize well. With the application framework and class library we have 
developed, many scientists with no experience in parallel computing can make 
use of the power of massively parallel computers for Monte Carlo simulations. 
The Alea library (Latin for “dice”), written in C-|— b, automatically parallelizes 
many types of Monte Carlo simulation at the two generic levels mentioned above. 

4.1 Classes for Monte Carlo Simulations 

From a user’s point of view, the library consists of three main classes, from which 
those specific to the Monte Carlo simulation are derived. 

— A simulation class handles the parallelization of the different runs and the 
merging of the results of these runs. The user only has to override a workO 
member function, specifying the amount of work which needs to be done on 
this simulation. This value is then used for load balancing and serves as a 
termination criterion once it is zero. 

— A run class implements the actual Monte Carlo simulation. The following 
functions have to be implemented for this class: 

— a constructor to start a new run 

— functions to access data in a dump, as discussed in Sec. 14.21 below. 

— a criterion is_thermalized() tells if the run is in equilibrium. 

— a function do_step() performs one Monte Carlo step and measurement. 

— A container class, measurements, collects all the Monte Carlo statistics. 

This is all the information the library needs to know about a specific Monte 
Carlo simulation. The library takes care of parameter input and startup, hard- 
ware independent checkpointing (see Sec. 14.21) . parallelization (see Sec. 14.31) and 
dynamic load balancing, and evaluation and output of results 

4.2 Object Serialization 

An object serialization scheme was introduced to enable reading/writing of ob- 
jects from/to data files, and transmission of objects to remote nodes. 

CH — h does not contain built-in object serialization, unlike Java. The C-b- 1- 
“iostream” library is also not suitable, since it is designed for text output and 
does not ensure that an object can be recreated from the textual output. 
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master node slave node #1 slave node #2 




Fig. 1. Illustration of the parallelization and remote object creation. Each box repre- 
sents an object. The labels of each object, from top to bottom, represent the class 
hierarchy from base class to most derived class. Solid lines are the creation of local 
objects, which in case of proxies send a message to a slave node to request the creation 
of the actual object. Illustrated is the creation of two simulations, which subsequently 
create three runs. 



However, our implementation of object serialization is modelled after the 
“iostream” library. Objects can be written to odump streams using operator << 
and read from idump streams using operator >>. Extensions to new classes are 
done just as in the iostream library, by overloading these operators. In particular, 
we have implemented two important types of such “dumps”: 

— xdr_odump and xdr_idump use the XDR format to write the data in a hard- 
ware independent binary format. These are used for hardware independent 
checkpoint files and for storing results of simulations. 

— mp_odump and mp_idump which use an underlying message passing library to 
send objects from one node to another. 

These latter classes allow easy parallelization using distributed objects. 

4.3 Parallelization Using Distributed Objects 

Simulations are parallelized as discussed in Sec. 0 The master node determines 
how much work needs to be done by each simulation and distributes the simu- 
lations across the available nodes accordingly. It then creates the simulation 
objects remotely, which in turn create one or several run objects. 

Remote object creation is done by creating a proxy object (called either 
remote_simulation or remote_run), which in turn sends a message to the re- 
mote node requesting the creation of the object. Remote method invocation 
similarly invokes the method of the proxy object, which then sends a message 
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to the remote node requesting the invocation of the method, and perhaps waits 
for a return value. Figure ^ shows this class hierarchy and method invocation. 
The scheme is simplified greatly compared to general distributed object systems 
since on each node, there exists at most one simulation and one run object. 

Slave nodes check for messages and perform the requested method invocati- 
ons. If no message needs to be processed, they call the do_step method of the 
local Monte Carlo run object to perform Monte Carlo steps. 



4.4 Support Classes 

In addition to the Monte Carlo simulation classes, the library also provides a 
variety of useful classes for parameter input, Monte Carlo measurements and 
their error analysis, the analysis of time series and the generation of plots. 



4.5 Failure Tolerance 

The library was designed to be tolerant to failure of single workstations on a 
workstation cluster when using PVM as the underlying message passing library. 
This is important as it allows us to perform the calculations on workstations in 
environments where other users reboot machines. Failure recovery is implemen- 
ted by period checkpointing and by automatically restarting failed simulations 
from the latest checkpoint. Since the C-| — h exception mechanism is not yet fully 
supported by compilers, the implementation of this feature of the library has 
been delayed until a future release. 



5 Object-Oriented Techniques for Monte Carlo 
Simulations 

The first version of the above Monte Carlo library was developed in 1994 and 
1995. At that time, the performance of C-|— I- for scientific simulations was not 
good enough to allow the use of object-oriented techniques for the CPU intensive 
parts of the actual Monte Carlo simulations. They were coded in C-style C-I--I- 
or even in FORTRAN. 

Meanwhile, the template mechanism has been extended and is supported by 
more compilers. The use of “light objects” ^ and expression templates EQin] 
allows a higher level of abstraction and the use of object-oriented design without 
any abstraction penalty in the performance. 

In the past year we have, with good success, made extensive use of such tem- 
plate techniques to develop generic, but still optimal, algorithms for a variety 
of condensed matter problems, and have used these programs successfully for a 
large number of simulations. This is, to our knowledge, one of the first applica- 
tions of such techniques to large high-performance numerical calculations. 
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5.1 Generic Simulations Using Templates 

Simulations of condensed matter problems have to be done for a variety of crystal 
lattice structures. Thus, it is advantageous to write a simulation program for 
general lattice structures. Usually, this is done by storing the lattice as a two- 
dimensional array of all neighbors of all sites. Using modern C-|— I- compilers, we 
can describe the lattice by a class, shown here for a chain of sites: 

class chain_lattice { 
public : 

typedef unsigned int site_number; 
chain_lattice(site_number 1) : length(l) {} 
site_number volume () {return length;} 
site_number neighbors (site_number site) {return 2;} 
inline site_number neighbor (site_number site, site_number nb) 
{return (nb ? (site==0? length-1 : site -1) 

: (site==length-l ? 0 : site+1));} 

private : 

site_number length; 

}; 

This class can then be used as a template parameter of the run class: 

template<class LATTICE, class MODEL>class user_run: public run { 
private : 

LATTICE lattice; 

MODEL model ; 
public : 

virtual void do_step() ; ... 

}; 

In the CPU-intensive part (the function do_stepO), most of the time is spent 
evaluating an interaction energy or cost function like: 

for (typename LATTICE: : site_number i=0 ; Klattice .volume (); ++i) 
for (typenamie LATTICE: : site_number n = 0; 
n < lattice.neighbors(site) ; ++n) 

{ 

. . . model . interact ion (state [i] , state [lattice .neighbor (i ,n)] ) ... 

} 

Implementing the lattice information through template parameters as inlined 
functions as above allows the compiler to optimize more aggressively. In this 
example the innermost loop can be unrolled, and no memory access is needed to 
determine the neighbor, in contrast to the typical FORTRAN implementation 
which stores the numbers of the neighbors in two-dimensional arrays. 

Similarly, Monte Carlo algorithms for a wide class of models and systems 
often differ just by a function describing the interaction, or by a type represen- 
ting the states. These functions are typically very simple, containing just a few 
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Fig. 2. Example of a world line configuration in a quantum mechanical simulation 



operations. Obviously, a generic program using virtual function calls to evaluate 
an interaction energy is prohibitive due to an immense performance penalty as- 
sociated with the virtual function call. On the other hand, providing the model 
as a template parameter, again allows generic, but highly optimized implemen- 
tations. Details and examples will be presented in forthcoming publications. 



5.2 Light Objects 

Another use of templates and “light objects”, which is maybe even more impor- 
tant, is for data structures representing a physical state. In the quantum Monte 
Carlo world line algorithm Pj , a state is represented by “world lines” of particles, 
as shown in Fig. El These world line configurations are described by the posi- 
tion (horizontal axis) and times (vertical axis) of kinks in the world lines. In the 
Monte Carlo procedure, these kinks are shifted around. For that, it is necessary 
to know for each kink, the neighbors in the time direction (thin arrows in the 
figure), and to the previous kink on neighboring sites (thick arrows). 

For simulations at low temperatures, where there are many such kinks along 
vertical lines, it is advantageous to store and update links to all these neighbors 
in a (z -I- 2)-fold linked list, where z is the number of neighbors. 

At high temperatures, however, it is faster to store only the links along one 
spatial site (thin lines) and to find the spatial neighbors (thick lines) by searching 
along the linked lists at those sites. 

Providing the actual representation of these kinks as a template parameter 
allows us to have optimized codes for both high and low temperatures available 
in the same program, which in other languages would only be possible using 
a preprocessor. Thus, being able to optimize and fine-tune the data structures 
easily has allowed us to get C-l— I- codes that run faster than FORTRAN programs 
for the same problem (33500 moves per second compared to 29000 in jOj) Note 
that this is not due to inherent speed of C-|— I- versus C or FORTRAN. It is caused 
by the fact that in practical complex applications, as compared to benchmarks, 
well coded C-l— I- allows easier optimization of data structures. 
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6 Experiences with Compilers and Machines 

We summarize our experiences in porting our codes to several parallel machines 
and workstation clusters. The GNU g++ compiler (version 2.7.2) sometimes 
caused problems when compiling our template codes with optimization, but no 
problems were encountered with egcs after release 1.0.2. We had no problems 
with the KAI C++ compiler. The KAI compiler produced code which performed 
about 20% better. 

On parallel machines we had no problems on the Cray T3E, due to availa- 
bility of the KAI compiler. On the Intel Paragon we used the GNU compiler 
successfully. The Hitachi SR2201, the fastest machine in the world at its intro- 
duction, gave us problems because there was limited template support in its 
C++ compiler. 

In Monte Carlo programs, message passing speeds are irrelevant since almost 
no communication is necessary. The performance bottleneck is the I/O to the 
parallel file systems, at the beginning of a job to load the last configuration, 
and at the end to store the new configurations. Allowing large jobs to run for a 
longer time would enable us to extend the scaling beyond 256 nodes. 

7 Summary and Applications 

The library and programs discussed have been used now with success for three 
years, and have enabled us to perform the largest Monte Carlo simulations ever 
done for quantum mechanical simulations in condensed matter physics, some- 
times three orders of magnitude larger than previous simulations. This in turn 
has allowed us to answer long-standing interesting questions in this fielcQ. 

The library will be publicly available when we finish rewriting it to use the 
new standard C-| — I- library. Interested persons can contact the authors to obtain 
the old version. In the future, we plan to extend template techniques to develop 
generic programs and classes also for other algorithms, methods, and quantum 
operators in the field of quantum simulations in physics. 
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Abstract. We present the first parallel, object-oriented C-|— I- imple- 
mentation of the dynamic recursion method. The recursion method is 
a means to tridiagonalize sparse matrices efficiently and is useful for a 
wide number of problems in physics. Dynamic recursion describes an 
optimization of the standard recursion method by operating only with 
a dynamically varying subset of basis vectors — reducing memory needs 
and allowing the computation of very large systems. We show how a 
graph-based data structure permits storing and multiplying sparse ma- 
trices and vectors efficiently. We use a tree structure to cope with the 
dynamically changing basis set, manifested by perpetual creation and 
elimination of vector components and matrix elements. A “workpile” 
approach is employed to allow thread-based parallel execution. Systems 
with up to 10^ matrix elements have been simulated with the current 
implementation and the Anderson metal-insulator transition has been 
studied as a test-bed project. 



1 Introduction 

With the rising popularity of object-oriented languages for scientific applications, 
a number of excellent programming “environments,” such as the Blitz-|— I- Q 
POOMA |2| packages, have been developed which, as a result of their dedication 
to the object-oriented paradigm, offer data structures and operations resembling 
physical or mathematical concepts. This allows the straightforward implementa- 
tion of scientific models while eliminating the need of machine-oriented thinking. 
POOMA, in addition, features completely encapsulated parallelism. 

Following the same philosophy, we have developed a C-I-+ kernel for appli- 
cations that are based on the dynamic recursion method HB). Our parallel 
“recursion engine” is applicable to a variety of problems that can be formulated 
as sparse matrices such as those common in condensed matter physics. The ap- 
plication programmer only needs to specify a short module of code describing 
the physical system under study. The parallel recursion algorithm is captured in 
the data structures provided and completely transparent to the user. 

2 Recursion Method 

A large number of problems in physics can be formulated in terms of sparse 
matrices which have only a small number of non-zero elements. In fact, this 
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is possible for any physical system with only short-ranged interactions in real 
space. Once the physical problem is formulated with sparse matrices, it is usually 
solved by a matrix transformation, which makes computational approaches at- 
tractive. While a diagonalization yielding eigenvalues and eigenvectors is useful 
for small systems, it is usually not the best choice for large systems, because 
diagonalizations are not only very resource-intensive, but the eigenvalues are 
also very sensitive to boundary conditions. Physically interesting quantities are 
distributions of eigenvalues, such as a projected density of states, which can be 
obtained from a tridiagonalization of the sparse matrix jSj . 

The recursion method |3] is an efficient technique for such a tridiagonaliza- 
tion. Suppose H is a, large symmetric (Hermitian) sparse matrix. For a physical 
system, H represents the system’s Hamiltonian in some suitably chosen basis 
set. Then, tridiagonalization amounts to the matrix transformation 



00^1 0 ^ 
bi fli 62 

62 02 ^ (1) 

■ 

0 bn dji / 

where U is the unitary transformation matrix to be foundQ The recursion me- 
thod successively generates the column vectors {u„} of the matrix t/ by a re- 



currence relation. Starting from some vector Uq, Ui is computed by 

Huo = oouo -k 61U1 , (2) 

and for n > 1 Hu„ = a„u„ -k 6„+iU„+i -k 6nU„_i . (3) 

The coefficients a„ and are determined by the requirement that the vectors 
{u„} be normalized and orthogonal to each other: 

an = ul^Hun , (4) 

^n+1 — [(-^ dnl^^n — l]”^ [(-^ dnl^^n l] ; (5) 

where I is the identity matrix. This prescription specifies all vectors Ui . . . u„+i. 



but not the start vector Ug. The choice of Ug determines the state of projection 
for the projected density of states that can be computed from the a„ and 
For the mathematical details of the relation between the tridiagonal matrix J 
and the projected density of states, see 0. 

The recursion method is computationally interesting for physical problems 
because the projected density of states converges exponentially fast with the 
number of recursions. In other words, the approximation of the (potentially in- 
finite) system H , that is represented by J, is improving exponentially with the 
number of transformation vectors included. Even more generally, this conver- 
gence property makes the method rather insensitive to any errors, such as those 



U^^HU = J, with J = 



^ We follow the convention of using the symbol ^ to denote the Hermitian conjugate. 
If all matrices are real, this is equivalent to the transpose, indicated by 
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arising from the finite precision of computer arithmetic, the loss of orthogonality 
of the basis vectors and others. Te ^ has shown that the total error will always 
remain bounded and not accumulate exponentially. The concept of dynamic re- 
cursion has grown from this idea: One can eliminate the smallest components 
in the vectors {u„}, thereby reducing memory use which, in turn, allows an 
increase in the maximally computable system size within the given resources, 
while the total error remains below a predetermined threshold. The effect is to 
“optimize” the basis set used in the transformation for the desired system size 
and available resources. 

In this group, dynamic recursion has been applied to the investigation of a 
metal-insulator transition in an alkali adlayer on a surface 0 and the study of 
the Anderson transition in a three-dimensional system jOj . The latter has been a 
test-bed project for our new parallel implementation of dynamic recursion. We 
have simulated systems with 10^ sites in initial runs — a number likely to increase 
once inefficiencies in the code have been rooted out. 

3 Data Structure and Algorithm for Dynamic Recnrsion 

An implementation of this dynamic recursion algorithm requires the following: 
(a) an efficient way to store the sparse matrix H and the associated vectors u„, 
allowing for dynamic generation and deletion of components; (b) an algorithm 
for the matrix- vector multiplication i?u„ of Eq. ®, interfacing with this storage 
scheme; (c) a partitioning of the problem suitable for concurrent execution. 

Let us consider a simple physical example: A 2-d square lattice, e.g. of atoms, 
with the sites (“nodes”) representing atomic states and the connecting lines (“ed- 
ges”) the non-zero overlap integrals between nearest neighbors (Fig.EI bottom) 
will result in a sparse matrix. Mathematically speaking, the nodes represent the 
set of basis vectors, in which the matrix is given, and the edges the matrix ele- 
ments between them. If one chooses some scheme for enumerating the sites, e.g. 
spiraling outward from some center site, the system can be described as a matrix 
H with the row and column indices referring to site indices. The matrix will be 
sparse because per site (per row in the matrix) there are only very few neighbors 
(non-zero column entries in that row). For a physical system, the elements of H 
can usually be derived from some formula; for the purpose of this paper, we will 
take the diagonal elements as zero and the off-diagonal elements as constant. 
The goal of the recursion now is to generate new basis vectors u„ . If the original 
basis vectors, for example, are given by the set 

( 1 , 0 , 0 , 0 , . . .f, ( 0 , 1 , 0 , 0 , . . .f, ( 0 , 0 , 1 , 0 ,.. .f, ( 0 , 0 , 0 , 1 ,.. .f, ... ( 6 ) 

then a new basis vector will have a number of components on the original basis 
vectors. If the zeroth vector of the new basis, Uq, is chosen 

uo = (1, 0, 0, 0, 0, 0, ...f , 
then, from Eq. Q, Ui = ^(0, 1, 1, 1, 1, 0, . . .)^ . 



(7) 

( 8 ) 
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Fig. 1. Components of u„ vectors for a square lattice, with n = 0, 1, 2, 5, 10, respec- 
tively. Shading indicates magnitudes, with black equaling 1.0 



Figure Q illustrates this spatial propagation of the vector components for 
a 2-d square lattice. All sites that carry a non-zero component (“weight”) are 
called “active”. The ensemble of all active sites represents a given vector u„. 
Multiplication with H, i.e. will yield a vector in which all the neighbors 

to the active sites become active as well, and only the matrix elements between 
active sites are referenced. These matrix elements can be generated when needed 
and do not have to be stored at all times. In dynamic recursion, small elements 
of u„ can be deleted and so can their associated matrix elements. 

A graph data structure mirroring the lattice, with nodes representing com- 
ponents in the vector u„ and edges the associated matrix elements, provides a 
natural way of storing the matrix and vector for the purposes of the i?u„ ope- 
ration which itself then amounts to a mere graph traversal (see pseudocode in 
Fig. 21 . Along with the components of u„, the components of u„_i are stored as 
well; they are needed for Eq. ( 0 . We can see from Eqs. 0 0 that per recursion 
step, two graph traversals are needed, because of the data dependency of Eq. 0 
on a„. In the pseudocode, these are referred to as first and second phase of the 
computation, respectively, with being calculated in the former. 

The active sites make up a lattice that grows as a result of the matrix- 
vector multiplication. Wherever there are sites with fewer neighbors than the 
coordination number (four in the 2-d square lattice), new nodes are added. Some 
mechanism is then needed to determine if some of the existing nodes of the graph 
may qualify as neighbors to a new one (Fig. 0 bottom). The neighbor-finding 
problem can be solved by a tree structure associated with a binary subdivision 
of coordinate space (Fig. El top), as it has been used, for example, in the N- 
body problem [2|. At the same time, the tree contains the functionality of the 
graph data structure, with the graph traversal then amounting to a tree-traversal 
which can be done recursively in an efficient manner. In addition, a tree structure 
allows for straightforward parallel execution. 

From an analytical viewpoint, the tree supplies an enumeration of the nodes. 
There is no unique way of doing this and the tree performs this enumeration in 
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Fig. 2. Graph representation of matrix (top left and bottom); subdivision of space and 
correspondence to a tree structure (top left and top right); growth of a new node from 
an existing one and identification of neighbors (bottom) 



such a way that the locality of the problem is preserved, i.e. nodes that are close 
together in space reside on adjacent branches of the tree. This makes neighbor- 
finding a local and thus fast process. At the same time, there is no performance 
penalty if the nodes are added and removed dynamically — a major advantage 
over other algorithms based on look-up lists or nearest neighbor maps. 

4 CH — h Class Design 

The abstract concepts of graph, tree and nodes introduced in Sect. 01 lend them- 
selves perfectly to an object-oriented implementation. C-|— I- has been chosen as 
language because of its wide availability and high performance thanks to the 
latest generation of optimizing compilers such as KAI’s C-I-+ compiler |B|. 

Figure 01 shows the C-I-+ design. The tree class hierarchy, headed by the ab- 
stract base class tree, is responsible for traversing the tree, finding neighbors 
and adding and deleting branches if this is called for as a result of nodes being 
created or killed. Intermediate levels in the tree are represented by branch ob- 
jects, pointing to other tree objects below. The terminals of the tree are formed 
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Fig. 3. C++ class hierarchy 



by a derived class that defines the specific lattice structure of the physical sy- 
stem; in our example, monoatomic_square_lattice implements a simple square 
or cubic lattice. This class can easily be substituted to simulate other materi- 
als while using the same API and machinery of the tree algorithms. The use of 
templates here is also conceivable. Typically a cluster of nodes, such as a unit 
cell of a crystal, is represented on the lowest level (cf. Fig. EJ. 

The aforementioned two phases of the computation are modeled by the classes 
first_pass and second_pass. The tree traversal routine is supplied with an 
instance of the appropriate computation object which has a number of virtual 
member functions, as defined by the abstract base class computation, that are 
invoked on each node encountered. Efforts are currently under way to rewrite 
the code using templates to avoid the performance overhead of virtual functions. 

The remaining classes in Fig. 0 perform certain auxiliary functions having 
to do with multi-threading and memory management. As dynamically allocated 
objects are of fixed size, heap management has been implemented very efficiently. 

5 Parallelization Issues 

For parallel execution, the computational problem needs to be divided up into 
concurrent sections. The tree traversal with neighbor look-up, representing the 
matrix-vector multiplication, is the computationally most expensive part in the 
recursion, but can easily be parallelized with different processors working on 
separate branches of the tree. Because the tree preserves locality, it offers inherent 
parallelism so that very little non-local data movement between processors is 
needed, minimizing communication bottlenecks. 

These facts lend themselves to an efficient implementation based on threads 
in combination with a “workpile” paradigm jOj. We have used the POSIX pthread 
library pum which executes multiple threads concurrently on a multi-processor 
computer. In our code, the tree is partitioned into subtrees below a certain level, 
and pointers to these subtrees (instances of concurrent_branch) are held by 
instances of f ist_pass or second_pass which are derived from the abstract base 
class j ob. All j ob objects are put on a workpile. A number of worker threads take 
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First pass of computation per recursion step to compute a-n, cf. Eq. m 
a[n] 0 

FOR i = 0 TO all nodes in the tree 
sum •<— X Hu 
FOR all neighbors of 

retrieve value u„ by neighbor-finding algorithm 
retrieve or generate matrix element Hij 
sum sum -\- uh x Hij 
IF neighbor doesn’t exist 

THEN grow new node, extend tree, add new branches as needed 
END FOR 

store sum for use in second pass 
a[n] a[n] + u„ x sum, cf. Eq. 0 

END FOR 

Second pass of computation per recursion step to compute bn+i, cf. Eq. p|) 

62 ^ 0 

FOR i = 0 TO all nodes in the tree 

62 •<— 62 -I- [i^Un]® — a[n\u\ — h\n]u\_i, with stored from first pass 

update uji, u„_i, cf. Eq. Q 
END FOR 

6[n +!]•<— %/62, cf. Eq. 0 
Neighbor-finding algorithm, cf. Ffg.0 

level < 1 (leaf level) 

WHILE and u^ are in different branches 
go up in tree, level ■4— level 1 
add coordinate of current octant to a traversal list 
END WHILE (common ancestor has been found) 

WHILE traversal list non-empty 

move down in tree, taking directions “conjugate” to the ones stored in list 
(e.g. if “south” was taken going up, “north” must be taken going down) 
remove item from traversal list 
END WHILE 



Fig. 4. Pseudocode for three algorithms: first and second pass of computation, and 
neighbor-finding algorithm 



jobs off the workpile and execute the associated operation, such as traversing 
the corresponding subtree. When the workpile runs empty, the coefficients a„ 
or bn can be computed from the accumulated results of all threads. Other job 
classes exist that “prune” the tree by eliminating empty subtrees or perform 
other maintenance tasks. If the assignment of subtrees to processors were fixed, 
load imbalances would soon result from the dynamic growth of the tree. This 
difficulty arises in distributed computing models like MPI HU. The latter are 
therefore harder to use efficiently for this problem. The thread-based approach, 
however, does require a shared memory architecture; the present program has 
been developed on a Silicon Graphics Power Challenge platform. 
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Another — rather subtle — problem arising in a shared memory multiprocessor 
architecture with cache is due to “false sharing.” This is a cache line invalidation 
occurring when separate addresses contained in the same cache line are written 
to and read from by different processors. Although not a race condition, the 
penalty for that can be so severe that the program may run slower with multiple 
threads than with a single one. This problem, typical for multi-threading, is 
avoided by careful data alignment. 

6 Conclusions and Outlook 

We have presented a first parallel implementation of dynamic recursion in a 
C-|— I- design which will allow ready adaptation to other problem domains, inclu- 
ding interacting systems. Efforts are currently under way to define a clean API, 
possibly within a similar framework as the POOMA or Blitz-|--|- systems. 

The program in the present form has been used to study the Anderson disor- 
der transition in a three-dimensional system 0. Other work on metal-insulator 
transitions in two dimension is currently under way. The program, however, has 
a much wider range of applicability and it is hoped that in the future it may be 
used for other problems as well. 
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Abstract. We discuss the object-oriented design of a software package 
for solving sparse, symmetric systems of equations (positive definite and 
indefinite) by direct methods. At the highest layers, we decouple data 
structure classes from algorithmic classes for flexibility. We describe the 
important structural and algorithmic classes in our design, and discuss 
the trade-offs we made for high performance. The kernels at the lower 
layers were optimized by hand. Our results show no performance loss 
from our object-oriented design, while providing flexibility, ease of use, 
and extensibility over solvers using procedural design. 



1 Introduction 

The problem of solving linear systems of equations Ax = b, where the coef- 
ficient matrix is sparse and symmetric, represents the core of many scientific, 
engineering and financial applications. In our research, we investigate algorith- 
mic aspects of high performance direct solvers for sparse symmetric systems, 
focusing on parallel and out-of-core computations. Since we are interested in 
quickly prototyping our ideas and testing them, we decided to build a software 
package for such experimentation. High performance is a major design goal, in 
addition to requiring our software to be highly flexible and easy to use. 

Sparse direct solvers use sophisticated data structures and algorithms; at the 
same time, most software packages using direct solutions for sparse systems were 
written in Fortran 77. These programs are difficult to understand and difficult 
to use, modify, and extend due to several reasons. First, the lack of abstract 
data types and encapsulation leads to global data structures scattered among 
software components, causing tight coupling and poor cohesion. Second, the lack 
of abstract data types and dynamic memory allocation leads to function calls 
with long argument lists, many arguments having no relevance in the context 
of the corresponding function calls. In addition, some memory may be wasted 
because all allocations are static. 

We have implemented a sparse direct solver using different programming 
languages at different layers. We have reaped the benefits of object-oriented 
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design (OOD) and the support that C++ provides for OOD, at the highest 
layer, and the speed of Fortran?? at the lower levels. The resulting code is more 
maintainable, usable, and extensible but suffers no performance penalty over a 
native Fortran?? code. To the best of our knowledge, this work represents the 
first object-oriented design of a sparse direct solver. 

We chose C++ as a programming language since it has full support for 
object-oriented design, yet it does not enforce it. The flexibility of C++ allows 
a software designer to choose the appropriate tools for each particular software 
component. Another candidate could have been Fortran 90, but it does not have 
inheritance and polymorphism. We need inheritance in several cases outlined 
later. We also wish to derive new classes for a parallel version of our code. We 
do not want to replicate data and behavior that is common to some classes. As 
for polymorphism, there are several situations when we declare just the interfaces 
in a base class and we want to let derived classes implement a proper behavior. 

In this paper we present the design of our sequential solver. Work on a parallel 
version using the message-passing model is in progress. Object-oriented packages 
for iterative methods are described in PO]. 

2 Overview of the Problem 

Graph theory provides useful tools for computing the solution of sparse systems. 
Corresponding to a symmetric matrix A is its undirected adjacency graph G(A). 
Each vertex in the graph corresponds to a column (or row) in the matrix and 
each edge to a symmetric pair of off-diagonal nonzero entries. 

The factorization of A can be modeled as the elimination of vertices in its 
adjacency graph. The factorization adds edges to G(A), creating a new graph 
G~^(A, P), where P is a permutation that describes the order in which the 
columns of A are eliminated. Edges in G~^ not present in G are called All edges 
and they correspond to All elements, nonzero entries in the filled matrix L + D + 

that are zero in A. 

The computation of the solution begins thus by looking for an ordering that 
reduces the All. Several heuristic algorithms (variants of minimum degree or 
nested dissection) may be used during this step. The result is a permutation P. 

Next, an elimination forest F{A, P), a spanning forest of G+(A, P), is com- 
puted. The elimination forest represents the dependencies in the computation, 
and is vital in organizing the factorization step. Even though it is a spanning 
forest of the Ailed graph, it can be computed directly from the graph of A and 
the permutation P, without computing the Ailed graph. In practice, a compres- 
sed version of the elimination forest is employed. Vertices that share a common 
adjacency set in the Ailed graph are grouped together to form supernodes. Ver- 
tices in a supernode appear contiguously in the elimination forest, and hence a 
supernodal version of the elimination forest can be used. 

The factorization step is split in two phases: symbolic and numerical. The 
flrst computes the nonzero structure of the factors and the second computes the 
numerical values. The symbolic factorization can be computed efficiently using 
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the supernodal elimination forest. The multifrontal method for numerical fac- 
torization processes the elimination forest in postorder. Corresponding to each 
supernode are two dense matrices: a, frontal matrix and an update matrix. Entries 
in the original matrix and updates from the children of a supernode are assem- 
bled into the frontal matrix of a supernode, and then partial dense factorization 
is performed on the frontal matrix to compute factor entries. The factored co- 
lumns are written to the factor matrix, and the remaining columns constitute 
the update matrix that carries updates higher in the elimination forest. 

Finally, the solution is computed by a sequence of triangular and diagonal 
solves. Additional solve steps with the computed factors (iterative refinement) 
may be used to reduce the error if it is large. 

When the coefficient matrix is positive definite, there is no need to pivot 
during the factorization. For indefinite matrices, pivoting is required for stability. 
Hence the permutation computed by the ordering step is modified during the 
factorization. 

Additional details about the graph model may be found in P|; about the 
multifrontal method in 0|; and about indefinite factorizations in pj. 



3 Design of the Higher Layers 

At the higher layers of our software, the goal was to make the code easy to 
understand, use, modify and extend. Different users have different needs: Some 
wish to minimize the intellectual effort required to understand the package, 
others wish to have more control. Accordingly, there must be different amounts 
of information a user has to deal with, and different levels of functionality a user 
is exposed to. 

At the highest level, a user is aware of only three entities: the coefficient 
matrix A, the right hand side vector 6, and the unknown vector x. Thus a user 
could call a solver as follows: 



X = Compute{A, b), 

expecting the solver to make the right choices. Of course it is difficult to achieve 
optimal results with such limited control, so a more experienced user would 
prefer to see more functionality. Such a user knows that the computation of the 
solution involves three main steps: (1) ordering, to preserve sparsity and thus 
to reduce work and storage requirements, (2) factorization, to decompose the 
reordered coefficient matrix into a product of factors from which the solution 
can be computed easily, and (3) solve, to compute the solution from the factors. 
This user would then like to perform something like this: 



P = Order {A), 

{L, D, P) = Factor{A, P), 

X = Solve{L, D, P, b). 
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Here, P is a, permutation matrix that trades sparsity for stability, L is a unit 
lower triangular or block unit lower triangular matrix, and D is a diagonal or 
block diagonal matrix. 

At this level the user has enough control to experiment with different algo- 
rithms for each one of these steps. The user could choose a minimum degree or a 
nested dissection ordering, a left-looking or a multifrontal factorization. In addi- 
tion, the user may choose to run some of the steps more than once to solve many 
related systems of equations, or for iterative refinement to reduce the error. 

We organized the higher layers of our software as a collection of classes that 
belong to one inheritance tree. At the root of the tree we put the Object class, 
which handles errors and provides a debugging interface. Then, since the two 
basic software components are data structures and algorithms, and since decou- 
pling them achieves flexibility, we derived a DataStructure class and an Algorithm 
class from Object. The first one handles general information about all structural 
objects and the second one deals with the execution of all algorithmic objects. 

An important observation is necessary here. While full decoupling needs per- 
fect encapsulation, the overhead introduced by some interfaces may be too high. 
Thus performance reasons forced us to weaken the encapsulation allowing more 
knowledge about several objects. For sparse matrices, for example, we store the 
data (indices and values) column-wise, in a set of arrays. We allow other ob- 
jects to retrieve these arrays, making them aware of the internal representation 
of a sparse matrix. We protect the data from being corrupted by providing 
non-const access only to functions that need to change the data. Such a design 
implementation may be unacceptable for an object-oriented purist. However, a 
little discipline from the user in accessing such objects is not a high price for a 
significant gain in performance. 

A user who does not want to go beyond the high level of functionality of the 
main steps required to compute the solution sees the following structural classes: 
S'parseSymmMatrix, Vector, Permutation and SparseLwTrMatrix. The first class 
describes coefficient matrices, the second right hand side and solution vectors, 
the third permutations, and the fourth both triangular and diagonal factors. 
We decided to couple these last two because they are always accessed together 
and a tight coupling between them leads to higher performance without any 
significant loss in understanding the code. The derivation of these four classes 
from DataStructure is shown in Fig. ^ 




Fig. 1. High level structural classes 
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Fig. 2. Some high level algorithmic classes 



At the same level the user also sees several algorithmic classes. First there are 
various ordering algorithms, such as NestDissOrder or MultMinDegOrder. Then 
there are factorization algorithms, like PosDefLeftLookFactor, PosDefMultFrt- 
Factor or IndefMultFrtFactor. Finally, the solve step can be performed by Pos- 
DefSolve or IndefSolve algorithms. Figure O describes the derivation of some of 
these classes from Algorithm. Using them one can easily write a solver (positive 
definite, for concreteness) shown in Fig.0 

More details are available beyond this level of functionality. The factoriza- 
tion is split in two phases: symbolic and numerical. The symbolic factorization 
is guided by an elimination forest. The multifrontal method for numerical facto- 
rization uses an update stack and several frontal and update matrices, which are 
dense and symmetric. Pivoting strategies for indefinite systems can be controlled 
at the level of frontal and update matrices during the numerical factorization 
phase. Figures Sand El depict the derivation of the corresponding structural and 
algorithmic classes. 

Classes such as SparseSymmMatrix, SparseLwTrMatrix, and Permutation are 
implemented with multiple arrays of differing sizes. Several of these are arrays 
of indices that index into the other arrays, so that the validity of the state of 
a class depends on not only the individual internal arrays, but the interaction 
between several of them. 

In a conventional sparse solver, these arrays are global and some of them 
are declared in different modules. A coefficient matrix, a factor, a permutation, 
or an elimination forest is not a well defined entity but the sum of scattered 
data. This inhibits software maintenance because of the tight coupling between 
disparate compilational units. 

There are also significant benefits in terms of type safety. For instance, a 
permutation is often represented as an array of integers. It could be that the 
index of the old number holds the new position or vice versa. We use oldToNew 
and newToOld to refer to the two arrays. The problem is that interpreting a 
newToOld permutation as an oldToNew permutation yields a valid operation, 
though an incorrect permutation. It is easy for users to reverse these two, par- 
ticularly when the names “permutation” and “inverse permutation” are applied 
since there is no agreement on whether newToOld is the former or the latter. 
Our Permutation class maintains both arrays internally and supplies each on 
demand. 
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main ( ) 

{ 

/* Load the coefficient matrix and the right hand side vector. */ 
SparseSymmMatrix aC'a.mat"); 

Vector bC'b.vec"); 

/* Reorder the matrix to reduce fill. */ 

Permutation p(a. getSize () ) ; 

MultMinDegOrder order(a, p) ; 
order . run ( ) ; 

/* Factor the reordered matrix. */ 

SparseLwTrMatrix l(a.getSizeO) ; 

PosDefMultFrtFactor factor(a, p, 1); 
factor . runO ; 

/* Declare algorithmic objects. */ 

Vector xCa.getSize 0 ) ; 

PosDef Solve solved, p, b, x) ; 
solve .runO ; 

/* Save the solution. */ 

X . save ( "x . vec" ) ; 



Fig. 3. A direct solver for sparse, symmetric positive definite problems at the highest 
level 



4 Design of the Lower Layers 

While the larger part of our code deals with the design of the higher layers, 
most of the CPU time is actually spent in few computationally intensive loops. 
No advanced software paradigms are needed at this level so we concentrated on 
performance by carefully implementing these loops. 

A major problem with C++ (also with C) is pointer aliasing, which makes 
code optimization more difficult for a compiler. We get around this problem 
by making local copies of simple variables in our kernel code. Another source 
of performance loss is complex numbers, since they are not a built-in in C-| — h 
data type as in Fortran. There is a template complex class in the Standard 
C++ library. Though this gives the compiler enough information to enforce all 
the rules as if it were a built-in datatype, it does not (indeed cannot) give the 
compiler any information about how to optimize for this class as if it were a 
built-in datatype. 

We implemented our computationally intensive kernels both in C++ and 
Fortran 77. A choice between these kernels and between real and complex arith- 
metic can be made using compile-time switches. We defined our own class for 
complex numbers but we make minimal use of complex arithmetic operators. 
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Fig. 4. Structural classes used by the multifrontal numerical factorization algorithms 




Fig. 5. Some symbolic and numerical factorization algorithmic classes 



which are overloaded. The bulk of the computation is performed either in C-I-+ 
kernels written in C-like style or in Fortran 77 kernels. Currently, we obtain 
better results with the Fortran 77 kernels. 

5 Results 

We report results obtained on a 66MHz IBM RS/6000 machine with 256 MB 
main memory, 128 KB LI data cache and 2MB L2 cache, running AIX 4.2. 
Since this machine has two floating point functional units, each one capable of 
issuing one fused multiply-add instruction every cycle, its peak performance is 
theoretically 266 Mflop/s. We used the Fortran 77 kernels and we compiled the 
code with xlC 3.1.4 (-03 -qarch=pwr2) and xlf 5.1 (-04 -qarch=pwr2) . 

We show results for three types of problems: two-dimensional nine-point 
grids, Helmholtz problems, and Stokes problems, using multiple minimum de- 
gree ordering and multifrontal factorization. We use the following notation: n 
is the numbers of vertices in G(A), (this is the order of the matrix), m is the 
number of edges in G{A), and is the number of edges in G'^{A, P), the 
filled graph. The difference between m'^ and m represents the fill. In Table [D 
we describe each problem using these three numbers and we also provide the 
cputime and the performance for the numerical factorization step, generally the 
most expensive step of the computation. Higher performance is obtained for the 
Helmholtz problems because complex arithmetic leads to better use of registers 
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and caches than real arithmetic. We achieved performance comparable to other 
solvers, written completely in Fortran 77. Hence there is no performance penalty 
due to the object-oriented design of our solver. 



Table 1. Performance on an IBM RS/6000 for three sets of problems from fluid dy- 
namics and acoustics. The cputimes (in seconds) and performance for the numerical 
factorization step are reported 



Problem 


n m time Mflop/s 


grid9.63 

grid9.127 

grid9.255 


3,969 15,500 104,630 0.77 34.2 

16,129 63,756 552,871 1.70 41.6 

65,025 258,572 2,717,313 10.89 47.4 


helmholtzO 

helmholtzl 

helmholtz2 


4,224 24,512 130,500 0.77 62.3 

16,640 98,176 639,364 4.72 77.8 

66,048 392,960 3,043,076 30.88 90.8 


e20r0000 

e30r0000 

e40r0000 


4,241 64,185 369,843 1.70 35.8 

9,661 149,416 1,133,759 6.56 40.2 

17,281 270,367 2,451,480 17.77 43.6 



We are currently implementing the solver in parallel using the message- 
passing paradigm. We plan to derive new classes to deal with the parallelism. 
Consider FrontalMatrix class, which stores the global indices in the index array 
and the numerical values in the value array. A Par FrontalMatrix class would 
need to add a processor array to store the owner of each column. A ParUpdate- 
Matrix class may be derived in a similar way from UpdateMatrix. Some parallel 
algorithmic classes would be needed as well. 
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Abstract. We propose Janus, a C++ template library of container clas- 
ses and communication primitives for parallel dynamic mesh applicati- 
ons. The paper focuses on two-phase containers that are a central com- 
ponent of the Janus framework. These containers are quasi-constant, 
i.e., they have an extended initialization phase after which they provide 
read-only access to their elements. Two-phase containers are useful for 
the efficient and easy-to-use representation of finite element meshes and 
generating sparse matrices. Using such containers makes it easy to encap- 
sulate irregular communication patterns that occur when running finite 
element programs in parallel. 



1 Introduction 

If we think of a finite element program as a collection of related objects on which 
operations are performed, we recognize that there are two types of application 
objects. The first are sets that represent spatial structures, and the second are 
numerical functions on these sets. Here are some examples of spatial structures 
and functions on them. 

— Given the node set N , many physically relevant data are represented by 

functions / : H> R. 

— The element matrices are a function from the triangulation e : T i-+ 
where p denotes the number of degrees of freedom per element. 

— The system matrix m is a function to : Af i-+ R, where A/" is a subset of 
N X N that represents the sparsity pattern of the matrix. 

The parallelism of finite element methods is mainly data parallelism with 
respect to the meshes. Using parallel computers with a distributed memory ar- 
chitecture requires therefore a partition of the triangulation T and the node set 
N. Whatever communication occurs when running a finite element program in 
parallel, it is caused by a relation of the meshes. A problem hereby is that due 
to their irregularity and size the relations must be partitioned themselves. 

The driving motivation behind the design of the Janus framework is to pro- 
vide application-oriented, easy-to-use, efficient abstractions for the fundamental 
components of finite elements methods mentioned above. Janus offers building 
blocks to represent (possibly partitioned) spatial structures and functions on 
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them. Moreover communication must be expressed explicitly based on the mesh 
relations. 

Janus is implemented as a CH — h template library. The library is generic with 
respect to the numerical types, the way the user wants to represent mesh points, 
and the mapping information that is used for partitioning the meshes. 

A fundamental concept in Janus is that of two-phase containers which are 
used to represent spatial structures with a non-trivial initialization phase. The 
lifetime of a two-phase container is separated into a generation phase and an ac- 
cess phase. The transition from one to the other phase is marked by a call to its 
freeze method. Such a concept is useful for the implementation of finite element 
methods, for three reasons. First, an adaptive finite element method can be con- 
sidered as a succession of generation and computation phases (cf. P^)- The same 
holds for the underlying patterns of sparse matrices. Second, the necessary initia- 
lizations are usually too complex to be performed in one call of a C-|— I- construc- 
tor. Extending the initialization phase helps to make the change of data structu- 
res transparent to the application programmer. Third,communication that might 
occur while generating distributed meshes can be delayed until the call of the 
freeze method. 

Another important concept is that of an associated container that is used for 
the representation of numerical functions on (distributed) spatial structures. 

An overview of these containers and their use in a sequential context is pre- 
sented in Sect.H Aspects of the parallel implementation and optimizations that 
are enabled by the use of two-phase containers are discussed in Sect. 0 We 
explain how the use of two-phase containers allows one to analyze irregular com- 
munication patterns as they occur in finite element sparse matrices. This is an 
important optimization for the iterative solution of parallel finite element pro- 
blems. 



2 Concepts and Classes in Janus 

Both from a conceptional and implementation point of view Janus is based on 
the containers and algorithms of the Standard C-|— I- Library j2j, also known as 
the Standard Template Library (STL) 0). The STL is not only a collection of 
fundamental data structures, generic classes, and algorithms. It defines concepts, 
i.e. generic sets of type requirements and its container classes are models of 
these concepts, i.e., they are types that satisfy these requirements. The idea is 
that: “Using concepts makes it possible to write programs that cleanly separate 
interface from implementation” 0. 



2.1 Two-Phase Containers 

A two-phase container is a variable-sized container that supports insertion of 
elements. However, all insertions must have been finished before any element of 
the container can be accessed. Only non-mutating access is allowed. 
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The phase in which insert operations are allowed is called the generation 
phase or first phase. The phase in which access is allowed is called access phase 
or second phase. The transition from the first to the second phase is marked by 
a call to the void freeze () method of a two-phase container. 

Containers that follow these requirements represent application objects that 
have a non-trivial yet clearly distinguishable initialization procedure. Typical 
examples are finite element meshes or sparse matrix patterns whose structure is 
not known at compile time. 

A two-phase container can be “frozen” only once and it has no thaw method 
that would allow new insertions. This means that a two-phase container cannot 
be used to implement meshes that are meant to be modified after their initia- 
lization. However, this is only an apparent restriction, since from a conceptional 
point of view it is often easier to represent mesh modification by creating a new 
mesh out of an existing one (cf. |I|). 



The FixedSet Container Family This provides two template classes that 
are models of the two-phase container concept, namely the DrderedFixedSet 
and the HashedFixedSet templates. Elements of each container type must be 
unique. 

The main difference between these two classes is that the DrderedFixedSet 
uses the STL set template class to initially store its data, whereas the other is 
implemented by the STL hash_set container. Both containers provide read-only 
random access to their elements. This is very natural, since in the second phase 
(when the container is frozen), it is no problem to number its elements from 0 
to sizeO. This property of two-phase containers can be exploited for a very 
efficient implementation of vector classes for finite element methods, which is 
explained in Sect. 12.21 

Note that the actual details of the representation of the sets (red-black tree 
or hash table) are hidden from the user. It is very easy to switch between both 
implementation strategies or even to mix them. This is in contrast to implemen- 
tation strategies that expose such low-level details to the application program- 
mer 0. 



Use of Two-Phase Containers The code fragment in Fig. Q shows a typical 
use of a two-phase container. Given the triangles of a finite element mesh, its 
nodes (in this particular case the vertices of the triangles) will be generated. We 
use a six-tuple of integers to denote triangles and their nodes. This allows us to 
express the triangle- vertex relation by simple index arithmetic jSl E| • To get the 
vertices of a triangle on a certain level of an adaptively refined mesh the short 
inline function vertices must be called. This is done for each triangle and the 
resulting nodes are inserted in the node set nodes. Even if the same node is 
inserted several times the implementation of the container assures that it occurs 
only once. 

After all vertices have been inserted, the nodes container is frozen. In case of 
an DrderedFixedSet container its elements are copied from a STL-set container 
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typedef 0rderedFixedSet<Index<6> , less<Index<6> > > Triangles; 
typedef QrderedFixedSet<Index<6> , less<Index<6> > > Nodes; 

Nodes create_nodes (int level, const Trianglesfe triangles) { 
Nodes nodes; 

Triangles :: const_iterator i; 

ford = triangles .begin 0 ; i != triangles . end() ; i++) { 

Tuple<Index<6>,3> v = vertices (*i, level); 
for(size_t j = 1; j <= 3; j++) nodes . insert (v [j] ) ; 

} 

nodes . freeze () ; 
return nodes ; 



Fig. 1. Using a two-phase container for the representation of the nodes of a finite 
element mesh 



that was used during the initialization phase to a dynamically allocated fixed-size 
array represented by the STL-vector container. 



The FixedRelation Container Family This family consists of two-phase 
containers to represent relations between two sets. Therefore they store pairs of 
elements of other sets. Except for some additional methods and type informa- 
tion about the underlying sets the interface of these classes is the same as for 
FixedSet containers. 

There is a special member of this family called IndexedFixedRelation. 
When calling freeze () the position of the components of its pairs with res- 
pect to the underlying sets are determined. It is shown in Sect. lO how this can 
be used for the efficient implementation of sequential finite element methods. 



2.2 Associated Containers 

Associated containers are primarily used for the efficient representation of nu- 
merical functions on sets represented by two-phase containers. An associated 
container is, by definition, a random-access container whose size is determined 
by that of a another container that represents the underlying set. When an as- 
sociated container is initialized it gets a reference to its underlying set object, 
which must be a fixed-size container or a frozen two-phase container. This allows 
for efficient storage of the elements, for example the STL valarray<T> could be 
used. Elements of associated containers can be accessed by random access or by 
access through elements of the underlying set (the at method). 

The SetArray class template is Janus’ standard model of an associated con- 
tainer. It offers no direct support for numerical operations. These services are 
provided by the template classes SetVector and SetMatrix which are wrap- 
pers around SetArray. The main difference between both containers is that 
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SetMatrix requires that the underlying set is a member of the FixedRelation 
container family. 



2.3 Interaction of Two-Phase and Associated Containers 



For each triangle the average value of a grid function u on the index set nodes 
will be computed and stored in a grid function x on the index set triangles. To 
determine the vertices of a triangle we use the vertices method again. Note that 
we iterate over the triangles through random access to x. setO which returns a 
reference to triaingles. 



void averageCint level, const SetArray<Nodes,double>& u, 
SetArray<Triangles,double>& x) { 
for(size_t i = 0; i < x.sizeO; i++) f 
Index<6> triangle = x.set()[i]; 

Tuple<Index<6>,3> v = vertices (triangle , level); 
x[i] = (u.at(v[l]) + u.at(v[2]) + u.at(v[3])) / 3.0; 

} 

} 



Fig. 2. Implementation of the function average 



The implementation shown in Fig. |3 looks quite appealing, but there are 
two problems with this usage of the at method. First, the at method will not 
work in the parallel case because the data that it tries to access may reside in 
another computational domain, and Janus does not support (for performance 
reasons) remote accesses to individual elements. Second, the overhead may be 
large even in the sequential case since a call of at causes a non-trivial search in 
the underlying set. 

A solution to both problems is to compute in advance the relation bet- 
ween the triangles and their vertices and to store them in a variable of type 
Tuple<Triangles_Nodes , 3>. That is, we consider the triangle triangle-vertex 
relation as three separate relations. 

Note that in the example in Fig. 0 the template IndexedFixedRelation 
(mentioned in Sect. 12.111 is used. The precalculated positions can be accessed 
through the methods indexl (size_t) and index2(size_t) . This means that in 
the sequential case the average procedure can be implemented as in Fig. 0 

Note that this use of the precalculated indices is nothing more than the 
traditional “index arrays” that are typically used in Fortran programs. In Janus 
these helper objects are set up when the container that holds relation is frozen. 
This computation is therefore transparent to the user. 
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typedef IndexedFixedRelation<Triangles ,Nodes> Triangles_Nodes ; 
Tuple<Triangles_Nodes , 3> 

triangle_vertex (const Trianglesfe t, const Nodes& n, int level) { 
Tuple<Triangles_Nodes,3> result (Triangles_Nodes (t ,n) ) ; 
for (Triangles :: const_iterator i=t.begin(); i!= t.endO; i++) { 
Index<6> triangle = *i; 

Tuple<Index<6>,3> node = vertices (triangle , level) ; 
for(size_t j = 1; j <= 3; j++) 

result [j] . insert (make_pair (triangle .node [j] ) ) ; 

} 

for(size_t j = 1; j <= 3; j++) result [j] .freezeO ; 
return result; 



Fig. 3. Creation of the triangle vertex relation 



void average(const SetArray<Nodes ,double>& u, 

const Tuple<Triangles_Nodes ,3>& r, SetArray<Triangles ,double>& x) 
for(size_t i = 0; i < x.sizeO; i++) 

x[i] = (u[r [1] . index2(i)] + u[r [2] . index2(i)] + 
u[r [3] . index2(i)] ) / 3.0; 



} 



{ 



Fig. 4. Revised sequential implementation of the function average. 



3 Parallel Environment 



With respect to a parallel implementation the programmer should have an 
SPMD programming model in mind. The library supports expressing data par- 
allelism on the level of meshes. This requires first of all that programmers have a 
good model to represent mesh nodes and elements. We advocate representations 
of meshes by so-called index spaces, sets of integer tuples plEl D- 

The great advantage of our indexing technique is that it provides application- 
oriented global names that are independent from implementation details. This 
allows expression of communication relations independent from the mapping of 
the indices onto the underlying hardware architecture. The approach of using 
integer tuples to place and retrieve data recalls the concept of tuple spaces 
in Linda However, in Janus these integer tuples are stored in two-phase 
containers whose access semantics are formed after the usage cycle of finite 
element meshes. This allows locally fast random access to the data. 
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3.1 Mapped Containers 

In a parallel and distributed environment the finite element meshes must be 
distributed over a group of abstract processes, which are called domains in Janus. 
As in MPI these processes are denoted by integers |B| • 

To represent distributed meshes in Janus the programmer uses mapped (two- 
phase) containers. Mapped containers have an additional template parameter 
that serves as a mapping type. As a mapping type any class that has a method 
domain can be used that assigns an integer to its argument. The mapped contai- 
ner uses the mapping type to decide to which domain an object that is inserted 
will be mapped. The idea of using mapping type template parameters has been 
taken from the runtime library of the PROMOTER programming model prmj. 
It gives the user greater flexibility in choosing appropriate mapping strategies. 

If an object is inserted into a mapped container then the mapping type is 
taken to check to which domain the object belongs. If the domain is the same as 
the one of the mapped container then it is inserted locally. Otherwise, it is put in 
a temporary buffer. When calling the freeze methods of the mapped container, 
the temporary buffers are sent to the appropriate domains where the objects 
are inserted. Delaying the communication is possible since elements are accessed 
only after the freeze method has been called. 

3.2 Communication in Janus 

To express communication in Janus the user must explicitly describe which 
points belong to the underlying mesh relation. Figure 0 showed the example 
of creating the triangle vertex relation. 

Note that in Janus the user describes the relation on the level of mesh points, 
in an application-oriented way. When creating a relation the user need not specify 
where the mesh points he refers to are actually stored. This necessary information 
is obtained by the library from the mapping objects of the mapped two-phase 
containers. 

Since the relation itself is stored in a two-phase container it will not change 
during its use. Thus it can be examined before its first use. Analyzing the sparsity 
patterns allows message buffers of the right size to be created in advance, thus 
reducing the communication overhead. This is a very important optimization 
for parallel sparse matrix multiplication, which is a key component of iterative 
methods. They are the preferred methods for the solution of large scale finite 
element problems. 

4 Concluding Remarks 

We have presented the major concepts of a template library for data parallel ad- 
aptive mesh applications. The concept of a two-phase container provides simple 
yet sufficient and efficient support for irregular structures such as finite element 
meshes and sparse matrix patterns. Two-phase containers are beneficial in a se- 
quential and parallel context and serve as a useful base for other concepts such 
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as associated containers. Using two-phase containers for the description of mesh 
relations allows irregular communication patterns to be analyzed when they are 
created. 

Currently we use a prototype of Janus for the parallel finite element ana- 
lysis on two-dimensional meshes. For the solution of the linear systems we use 
the conjugate gradient method with a simple diagonal preconditioner. In future 
we will incorporate multilevel preconditioners and adaptive refinement into the 
solver. The necessary abstractions are already contained in Janus. 
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Abstract. The Blitz++ library provides numeric arrays for C++ with 
efficiency that rivals Fortran, without any language extensions. Blitz++ 
has features unavailable in Fortran 90/95, such as arbitrary transpose 
operations, array renaming, tensor notation, partial reductions, multi- 
component arrays and stencil operators. The library handles parsing and 
analysis of array expressions on its own using the expression templates 
technique, and performs optimizations (such as loop transformations) 
which have until now been the responsibility of compilers. 



1 Introduction 

The goal of the Blitz++ library is to provide a solid “base environment” of 
arrays, matrices and vectors for scientific computing in C-| — h. This paper focuses 
on arrays in Blitz++, which provide performance competitive with Fortran and 
superior functionality. The design of Blitz++ has been influenced by Fortran 
90, High-Performance Fortran, the Math.h++ library P, A++/P++ |2j, and 
POOMA P|. It incorporates various features from these environments, and adds 
many of its own. This paper concentrates on the unique features of Blitz++ 
arrays. 

2 Overview 

Multidimensional arrays in Blitz++ are provided by the class template Array<T, 
N>. The template parameter T is the numeric type stored in the array, and N is 
its rank. This class supports a variety of array models: 

— Arrays of scalar types, such as Array<int,2> and Array<f loat ,3> 

— Complex arrays, such as Array<complex<float>,2> 

— Arrays of user-defined types. For example, if Polynomial is a class defined 
by the user (or another library), Array<Polynomial,2> is a two dimensional 
array of Polynomial objects. 

— Nested homogeneous arrays using the Blitz++ classes TinyVector and 
TinyMatrix. For example, Array<TinyVector<float,3>,3> is a 
three-dimensional vector field. 

— Nested heterogeneous arrays, such as Array<Array<int , 1> , 1>, in which 
each element is an array of variable length. 
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2.1 Storage Layout and Reference Counting 

Array objects are lightweight views of a separately allocated data block. This 
design permits a single block of data to be represented by several array views 
IP. Each array object contains a descriptor (also called a dope vector) which 
specifies the memory layout. The descriptor contains a pointer to the array data, 
lower bounds for the indices, a shape vector, a stride vector, reversal flags, and 
a storage ordering vector. This last is a permutation of the dimension numbers 
[1,2,..., N] which indicates the order in which dimensions are stored in memory. 
Fortran-style column-major arrays correspond to [1, 2, . . . , iV], and C-style row- 
major arrays correspond to [N, N — Reversal flags indicate whether 

each dimension is stored in ascending or descending order. 

The storage ordering vector and reversal flags allow arrays to be stored in 
any one of N12^ orderings. Only two of these - C and Fortran-style arrays - 
are frequently used. There are occasional uses for other orderings: some image 
formats store rows from bottom to top, which can be handled transparently by 
a reversal flag. 

Arrays are reference-counted: the number of arrays referencing a data block is 
monitored, and when no arrays refer to a data block it is deallocated. Reference 
counting provides the benefits of garbage collection, and allows functions to 
return array objects efficiently: 

Array<float ,2> someUserFunction(Array<f loat , 2>&) ; 

Reference-counting and flexible storage formats support useful 0(1) array ope- 
rations: 

— Arbitrary transpose operations: The dimensions of an array can be permuted 
using the transpose (. . .) member function. This code makes B a shared 
view of A, but with the first and second dimensions swapped: 

Array<f loat , 3> A(3,3,3); // A 3x3x3 array 

Array<f loat , 3> B=A.trainspose(secondDim,firstDim,thirdDim) ; 

The integer constants firstDim, secondDim, ... are intended to improve 
readability, and hide confusion over whether the first dimension is 1 (as in 
Fortran) or 0 (as in C). 

— Dimension reversals: Each dimension can be independently reversed. If A 
contains a two-dimensional colour image, then 

Array<RGB24,2> B = A. reverse (f irstDim) ; 
flips the image vertically. 

— Array relabelling: Since array objects are really lightweight handles, arrays 
can be swapped and relabelled in constant time. This is very useful in time- 
stepping PDFs. If Al, A2 and A3 represent a held at three consecutive time- 
steps, cycleArrays(Al,A2,A3) relabels the arrays for the next time step: 
[A1,A2,A3] ^ [A2,A3,A1]. This avoids costly copying of the array data. 

— Array interlacing: Blitz-|— I- allows arrays of the same shape to be interlaced 
in memory. Such an arrangement improves data locality, which can increase 
performance in some situations. 
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2.2 Subarrays and Slicing 

Subarrays in BlitzH — h are fully functional Array objects which have a shared 
view of the array data. Subarrays can either be full rank or lesser rank. Blitz++ 
supplies Range objects which emulate the Fortran 90 range syntax. Any combi- 
nation of Range and integer values can be used to obtain a subarray: 

Array<float ,3> A(64,64,64); // A 64x64x64 array 

// C refers to the 2D slice A(10..63, 15, 0..63) 

Array<float ,2> C = A(Range (10 ,toEnd) , 15, Range :: all ()) ; 

Array<f loat , 1> D = A (Range (fromStart, 30) ,15,20) ; //A(0 . . 30 , 15, 20) 

The use of fromStart and toEnd is after An optional third parameter to the 
Range constructor specifies a stride, so subarrays do not have to be contiguous. 

3 Array Expressions 

Array expressions in Blitz-|— I- are implemented using the expression templates 
technique 0. Prior to expression templates, use of overloaded operators meant 
generating temporary arrays, which caused huge performance losses. In Blitz-|— 1-, 
temporary arrays are never created. Since its original development, the expres- 
sion templates technique has grown substantially more complex and powerful 
Its present incarnation in Blitz-|— I- supports a wide variety of useful not- 
ations and optimizations. The next sections overview the main features of the 
Blitz-|— I- expression templates implementation from a user perspective. 

3.1 Operators 

Any operator which is meaningful for the array elements can be applied to arrays. 
For example: 

Array<f loat ,2> A, B, C, D; // ... 

A = B + (C * D) ; 

Array<int,l> E, F, G, H; // ... 

E 1= (F & G) » H; 

Operators are always applied in an elementwise manner. Users can create arrays 
of their own classes, and use whichever overloaded operators they have provided: 

class Polynomial { 

// define operators + and * 

}: 



Array<Polynomial,2> A, B, C, D; // ... 

A = B + (C*D) ; // results in appropriate calls 

//to Polynomial operators 
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Math functions provided by the standard C++, IEEE and System V math li- 
braries may be used on arrays, for example sin (A) and Igaimna(B). 

Arrays with different storage formats can appear in the same expression; for 
example, a user can add a C-style array to a Fortran array. Blitz++ transparently 
corrects for the storage formats. Blitz++ allows arrays of different numeric types 
to be mixed in an expression. Type promotion follows the standard C rules, with 
some modifications to handle complex numbers and user-defined types. 

Blitz-1 — h supplies a set of index placeholder objects which allow array indices 
to be used in expressions. This code creates a Hilbert matrix: 

Array<f loat ,2> A(4,4); 

// i and j are index placeholders 
firstlndex i; 
secondlndex j ; 

A = 1.0 / (1+i+j); // Sets A(i,j) = . . . for all (i,j) 



3.2 Tensor Notation 

Blitz++ provides a notation modelled on tensors. Here is an example of mathe- 
matical tensor notation: 



C^k ^ jp,j^k _ j^jkyi 

In Blitz++, this equation can be coded as: 

using namespace blitz: : tensor; 

C = A(i,j) * x(k) - A(j,k) * y(i); 

The tensor indices i, j ,k, . . . are special objects concealed in the namespace 
blitz: : tensor. Users are free to declare their own tensor indices with different 
names if they prefer. Tensor indices specify how arrays are oriented in the domain 
of the array receiving the expression (Fig. [Ql. Any missing tensor indices are 
interpreted as spread operations; for example, the A(i,j) term in the above 
example is spread over the k index. 

Unlike real tensor notation, repeated indices do not imply contraction. For 
example, the tensor expression C'*-’ = implies a summation over k. In 

Blitz++, contractions must be written explicitly using a partial reduction (de- 
scribed later): 

Array<float ,2> A, B, C; // ... 

C(i,j) = sum(A(i,k) * B(k,j), k) ; 
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Fig. 1. Illustration of Blitz++ tensor notation: the indices specify how arrays are 
oriented in the domain of the array receiving the result 



3.3 Stencil Objects and Operators 

Blitz++ provides a stencil object mechanism that removes much of the drudgery 
from writing finite difference equations. One of the Blitz++ example programs is 
a three-dimensional computational fluid dynamics simulation. In each iteration, 
the velocity held is time-stepped according to the equation 



V ^V + At {p-^ (r/V^V - VP + F) - A) 

where V, P, F and A are velocity, pressure, force, and advection. Implementing 
this equation using 4th-order accurate finite differencing in Fortran requires a set 
of mammoth equations with approximately 70 terms. Using a Blitz-|— I- stencil 
object, the equation is written as: 

nextV = *V + delta_t * (recip_rho * (eta * Laplacian3DVec4(V, 
geom) - grad3D4(P, geom) + *force) - *advect) ; 

The vector fields V, force and advect are implemented as arrays of 3-vectors. 
This eliminates the need to represent each vector held as three separate arrays, 
common in Fortran implementations. The stencil operators LaplacicUi3DVec4 
and grad3D4 are provided by Blitz-|— 1-, and implement 4th-order Laplacian and 
gradient operators. The Laplacian3DVec4 operator expands into a 45-point sten- 
cil. Blitz-1 — h supplies stencil operators for forward, central and backward diffe- 
rences of various orders and accuracies; built on top of these are divergence, 
gradient, curl, mixed partial, and Laplacian operators. 

Blitz-|--|- provides special support for vector fields (and in general, multi- 
component/multispectral arrays). The [] operator is overloaded for easy access 
to individual components of a multicomponent array. For example, this code 
initializes the force held with gravity: 

const int x=0, y=l, z=2; 
force [x] = 0.0; 
force [y] = 0.0; 
force [z] = gravity; 
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3.4 Reductions 

Reductions in Blitz++ transform an N-dimensional array (or array expression) 
to a scalar value: 

Array<int,2> A(4,4); // ... 

int resultl = sum(A) ; // sum all elements 

int result2 = count (A == 0) ; // count zero elements 

Available reductions are sum, product, min, max, count, minlndex, maxlndex, 
any and all. Partial reductions transform an N-dimensional array (or array 
expression) to an N-1 dimensional array expression. The reduction is performed 
along a single rank: 

Array<int,2> A(2,4); 

Array<int,l> B(2); 

A = 0, 1, 1, 5, 

3 , 0 , 0 , 0 ; 

B = sum(A, j); // Reduce along rows: B = [ 7 3 ] 

Reductions can be chained: for example, this code finds the row with the mini- 
mum sum of squares: 

Array<float ,2> A(N,N); // ... 

int minRow = minlndex ( sum (pow2 (A) ,k) ) ; 

4 Optimizations 

The expression tempaltes technique allows Blitz-|— I- to parse array expressions 
and generate customized evaluation kernels at compile time. To achieve good 
performance, Blitz-|— I- performs many loop transformations which have traditio- 
nally been the responsibility of optimizing compilers: 

— Loop interchange and reversal: Consider this bit of code, which is a naive 
implementation of the array operation A = B + C: 

for (int i=0; i < Nl; ++i) 
for (int j=0; j < N2; ++j) 
for (int k=0; k < N3; ++k) 

A(i,j,k) = B(i,j,k) + C(i,j,k); 

The layout of these arrays in memory is unknown at compile time. If the 
arrays are stored in column-major order, this code will be very inefficient 
because of poor data locality. For large arrays, an entire cache line would 
have to be loaded for each element access. To avoid this problem, Blitz-|— I- 
selects a traversal order at run-time such that the arrays are traversed in 
memory-storage order. 
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— Hoisting stride calculations: The inner loop of the above code fragment would 
expand to contain many stride calculations. Blitz++ generates code which 
hoists the invariant portion of the stride arithmetic out of the innermost 
loop. 

— Collapsing inner loops: Suppose that in the above code fragment, N3 is quite 
small. Loop overhead and pipeline effects will conspire to cause poor perfor- 
mance. The solution is to convert the three nested loops into a single loop. 
At runtime, Blitz-|— I- collapses the inner loops whenever possible. 

— Partial unrolling: Many compilers partially unroll inner loops to expose low- 
level parallelism. For compilers that won’t, Blitz-|--|- does this unrolling itself. 

— Common stride optimizations: Blitz-|— I- tests at run-time to see if all the 
arrays in the expression have a unit or common stride. If so, faster evaluation 
kernels are used. 

— Tiling: Blitz-|— I- detects the presence of stencils, and does tiling to ensure 
good cache use. 



4.1 Benchmark Results 

Figure 0 shows performance of the Blitz-|--|- classes Array and Vector for a DA- 
XPY operation on the Cray T3E-900 (single PE) using KAI C-|— 1-. The Blitz-|— |- 
classes achieve the same performance as Fortran 90fl The native BLAS libr- 
ary is able to outperform both Fortran 90 and Blitz-|--|-0 Without expression 
templates, performance is typically 30% that of Fortran. 

TableOlshows performance of Blitz-|— I- arrays on 21 loop kernels used by IBM 
for benchmarking the RS/6000. Performance is reported as a fraction of Fortran 
performance: > 100 is faster, and < 100 is slower. The fastest native Fortran 
compiler was used, with typical optimization switches (-03, -Dfast). The loop 
kernels and makefiles are available as part of the Blitz-|— I- distribution. 



Table 1. Performance of Blitz-|--|- on 21 loop kernels, relative to Fortran 



Platform/ 

Compiler 


Out of cache 
Median Mean 


In-cache (peak) 
Median Mean 


Cray T3E/KCC 
HPC-160/KCC 
Origin 2000/KCC 
Pentium Il/egcs 
RS 6000/KCC 
UltraSPARC/KCC 


95.7% 86.4% 
100.2% 97.5% 
88.1% 87.3% 
98.4% 98.5% 
93.5% 90.7% 
91.1% 86.8% 


98.1% 88.4% 
95.1% 93.4% 
79.8% 78.6% 
79.6% 82.6% 
97.3% 93.2% 
79.0% 78.3% 



^ Fortran 77 is no longer supported on the T3E, and is actually slower. The flags used 
for the f90 compiler were -03, aggress ,unroll2, pipelines. 

^ Although not yet implemented, it is possible to do pattern matching to native BLAS 
using expression templates, an idea due to Roldan Pozo. 
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Fig. 2. DAXPY benchmark on the Cray T3E (single PE) 
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Abstract. POOMA is a templated C++ class library for use in the 
development of large-scale scientific simulations on serial and parallel 
computers. POOMA II is a new design and implementation of POOMA 
intended to add richer capabilities and greater flexibility to the frame- 
work. The new design employs a generic Array class that acts as an 
interface to, or view on, a wide variety of data representation objects 
referred to as engines. This design separates the interface and the repre- 
sentation of multidimensional arrays. The separation is achieved using 
compile-time techniques rather than virtual functions, and thus code ef- 
ficiency is maintained. POOMA II uses PETE, the Portable Expression 
Template Engine, to efficiently represent complex mathematical expressi- 
ons involving arrays and other objects. The representation of expressions 
is kept separate from expression evaluation, allowing the use of multiple 
evaluator mechanisms that can support nested where-block constructs, 
hardware-specific optimizations and different run-time environments. 



1 Introduction 

Scientific software developers have struggled with the need to express mathema- 
tical abstractions in an elegant and maintainable way without sacrificing per- 
formance. The POOMA (Parallel Object-Oriented Methods and Applications) 
framework PS, written in ANSI/ISO C++, has demonstrated both high ex- 
pressiveness and high performance for large-scale scientific applications on plat- 
forms ranging from workstations to massively parallel supercomputers. POOMA 
provides high-level abstractions for multidimensional arrays, physical meshes, 
mathematical fields, and sets of particles. POOMA also exploits techniques such 
as expression templates |2| to optimize serial performance while encapsulating 
the details of parallel communication and supporting block-based data compres- 
sion. Consequently, scientists can quickly assemble parallel simulation codes by 
focusing directly on the physical abstractions relevant to the system under study 
and not the technical difficulties of parallel communication and machine-specific 
optimization. 

POOMA II is a complete rewrite of POOMA intended to further increase ex- 
pressiveness and performance. The array and field concepts have been redesigned 

* This work was performed under the auspices of the U.S. Department of Energy by 
Los Alamos National Laboratory under Contract No. W-7405-Eng-36. 
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to use a powerful and flexible view-based architecture that decouples interface 
and representation. Expressions involving arrays and fields are packaged and 
manipulated using an enhanced version of PETE, the Portable Expression Tem- 
plate Engine. These expressions can operate on subsets of the data, specified 
via multiple-dimensional domain objects. Finally, the expressions are efficiently 
evaluated by evaluator objects. These evaluators support a variety of run-time 
systems, ranging from immediate serial evaluation to thread-based parallel eva- 
luation, as well as complex constructs like where-blocks. 



2 Arrays and Engines 

An array is a logically rectilinear, N-dimensional table of numeric elements. Most 
array implementations store their data in a contiguous block of memory and 
apply Fortran or C conventions for interpreting this data as a multidimensional 
array. Unfortunately, these two storage conventions do not span the full range of 
array types encountered in scientific computing: diagonal, banded, symmetric, 
sparse, etc. One can even imagine arrays that use no storage, computing their 
element values as functions of their indices or via expressions involving other 
arrays. One approach to dealing with differing array storage strategies is to 
simply create new array classes for each case: BandedArray, SparseArray, and 
so on. However, this is wasteful since all of these variants have very similar 
interfaces. 

POOMA IPs array class provides a uniform interface independent of how 
the data is stored or computed, without incurring the overhead of C-|— I- virtual 
function calls. This is accomplished by introducing the concept of an engine. An 
engine is an object that provides a common interface for randomly accessing and 
changing elements without the need for the user of the engine to know how the 
elements are stored. For example, an engine that manages a 100 x 200 “brick” 
of double-precision values is declared as: 

Engine<2, double, Brick> brickClOO, 200); 

The domain of this engine is the tensor product of [0 . . . 99] by [0 . . . 199]. Si- 
milarly, an engine that manages a brick of data distributed across a parallel 
machine in a manner specified by an object layout is declared as: 

Engine<2, double, Distributed> dbrick(100, 200, layout); 

The domain and range of dbrick are identical to that of brick, as is the interface 
for accessing elements. However, the implementations are quite different. 

Note that engine classes are all specializations of a common template. Engine. 
A tag is used to specify a particular engine, such as Brick or Distrubuted, 
allowing useful default template parameters to be chosen for the array class. 

Engines represent a low-level abstraction: getting single elements from a data 
source. The POOMA II array facility provides an efficient, high-level interface 
to engines. POOMA II arrays are declared as follows: 
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Array<2, double, Brick> A (100, 200); 

Array<2, double, Distributed> B(100, 200, layout); 

This is a variant of the envelope-letter idiom Array (the envelope) delegates 
all operations to the particular sort of engine (the letter) that it contains. Howe- 
ver, compile-time polymorphism, rather than run-time polymorphism, is used for 
faster performance. In POOMA II, the engines own the data and arrays simply 
provide an interface for viewing and manipulating that data. In this sense they 
have semantics similar to iterators in the Standard Template Library |S| , except 
that they automatically dereference themselves. To enforce const correctness, 
POOMA II provides a ConstArray class (similar to the STL const_iterator) 
that prohibits modification of its elements. 

3 Domains and Views 

Domain objects represent the region or set of points on which an array will define 
values. An N-dimensional domain is composed of N one-dimensional domains 
and represents the tensor product of these domains. POOMA II includes several 
domain classes: 

— Loc<N>: A single point in N-dimensional space. 

— Interval<N>: The tensor product of N one-dimensional sequences each ha- 
ving unit stride. 

— Range<N>: Similar to Interval<N>, with strides specified at run time. 

— Index<N>: Similar to Rauige<N>, but with special loop-ordering semantics 
(see below). 

— Region<N>: Tensor product of N one-dimensional continuous domains. 

Users choose the domain type that best expresses any constraints that they 
wish to impose on the domain. For example. Interval is used for unit-stride 
domains and Loc is used for single-point domains. This allows POOMA II to 
infer properties of the domain at compile time and optimize code accordingly. 

One of the primary uses of domains is to specify subsections of Array objects. 
Subarrays are a common feature of array classes; however, it is often difficult 
to make such subarrays behave like first-class objects. The POOMA II engine 
concept provides a clean solution to this problem: subsetting an Array with a 
domain object creates a new Array that has a view engine. For example: 

Interval<l> 1(10); // I = {0, 1, ...,9} 

Array<l, double, Brick> A(I) ; 

Range<l> J(0,8,2); // J = {0, 2, ..., 8 } 

Array<l , double ,BrickView> B = A(J); 

The new array B is a view of the even elements of A: {A (0) , A(2) , ..., A(8)}. 
Note that views always act as references (i.e., B(0) is an alias for A(0), B(l) is an 
alias for A(2) , etc.). The task of determining the type of view engine to use when 
subsetting an Array is handled by the NewEngine traits class. Specializations of 
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the class template NewEngine define a trait Type_t that specifies the type of 
engine that is created when a particular engine type is subsetted by a particular 
domain type. Thus, in the above example we could have written: 

typedef 

NewEngine< Engine<l, double, Brick>, Rcinge<l> >::Type_t View_t; 

Array<l , double ,View_t : :Tag_t> B = A(J) ; 

While users can explicitly declare view-engine-based array objects in the 
manner above, these views will usually be created as temporaries via subscripting 
and then used in expressions and function calls to specify the elements on which 
to operate. For example: 

lnterval<l> 1(10), 12(2,5); // 12 = {2, 3, 4, 5} 

Array<l, double, Brick> A(l) , C(l); 

C(12) = A(12+l) - A(12-l); // C(2) = A(3) - A(l), etc. 

The final expression builds three temporary views and then executes the expres- 
sion on these views. 

In multidimensional cases, there can be multiple interpretations of certain 
expressions involving views of arrays. For example, if 1 and J are domain objects, 
then what does A (1 , J) = B(J,1) mean? If 1 and J are Interval objects of equal 
length, then this would be an element-wise assignment. However, POOMA IPs 
lndex<N> domain objects have knowledge of their loop ordering. If these domain 
objects are used, then A(1,J) = B(J,1) assigns the transpose of B to A. Thus, 
the user can choose between tensor-like subscript semantics and Fortran 90 array 
semantics simply by choosing different domain types. 

4 Expressions and Evaluators 

Most of the computation in a POOMA II code takes place in mathematical 
expressions involving several arrays. Expression templates and template meta- 
programs [?] are used to support an expressive syntax and to implement a num- 
ber of compile-time optimizations. The most common of these optimizations is 
converting these high-level expressions into efficient low-level loops. 

Expression templates work by storing the parse 
tree of an expression with operator objects at non- 
leaf nodes and data objects at the leaves. An ex- 
pression object is templated on a type that enco- 
des the structure of the parse tree so that the parse 
tree can be manipulated at compile time to pro- 
duce efficient code. Consider the sample expres- 
sion parse tree shown in Fig. □ PETE encodes 
this parse tree in an object of type: 

TBTree< DpAssign, Arrayl 

TBTree< OpPlus , Const Array2, 

TBTree< OpMultiply, Scalar<int>, ConstArrayB > > > 




Fig. 1. Parse tree for the ex- 
pression A = B + 2 * C 
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containing references to arrays A, B and C, and the scalar 2. This expression 
object can be used to generate an optimized set of loops. However, it does not 
have array semantics and is not an Array, so it cannot be passed to functions 
expecting an Array. 

The POOMA II engine architecture provides a solution to this problem: 
the expression engine. An expression engine wraps a PETE expression with an 
engine interface. Values of an expression engine are computed efficiently by the 
expression-template machinery based on the data referred to in an expression 
object. With this innovation, the result of an expression involving Array objects 
is an Array. Thus, users can write functions that operate on expressions by 
templating them for arbitrary engine types. For example, 

template<int Dim, class T, class ET> 

T traceCconst Const Array<Dim,T,ET> &a) { 

T tr = 0; 

for (int i = 0; i < a.length(O); ++i) { tr += a(i,i); } 
return tr; 

} 

Then trace (B+2*C) sums the diagonal components of B+2*C without computing 
any of the off-diagonal values. 

Expression evaluation is a separate component from the array and expression 
objects. Evaluators only require a few basic services from arrays and expressions: 
subsetting, returning an element, getting a domain, etc. Any object that can 
use those services to evaluate expressions qualifies as an Evaluator. Expression 
evaluation is triggered by the assignment operator of Array, which builds a 
new Array that has an expression engine and hands it off to an Evaluator. 
Each expression is defined on a domain, and the Evaluator invokes a function 
specialized on the type of the domain to evaluate the expression at each point. 

For example, suppose an expression is defined on a domain that has only 
STL-style iterators for looping over the domain. Then, if the domain object is 
dom and the expression-array object is expr, the inner evaluation loop could look 
like 

for (dom: : iterator dp = dom.beginO; dp != dom.endO; ++dp) 
expr (*dp) ; 

If the domain is a two-dimensional Interval, for which we know that the strides 
are all unity, the inner loops would look like 

for (int j = 0; j < dom[l] . lengthO ; ++j) 
for (int i = 0; i < dom [0] . lengthO ; ++i) 
expr(i, j); 

The type of inner loop can be determined at compile time since it depends on 
the type of the domain. That allows the most specialized — and therefore the 
most efficient — code to be used for the provided data structures. 

The Evaluator classes also provide a where-block interface, enabling code 
such as 
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where (A < 1) ; 

B = A; 

elsewhere 0 ; 

B = 1 - A; 
endwhere 0 ; 

This code sets array B to A wherever A is less than 1 and 1 - A otherwise. Each 
call to whereO, elsewhereO and endwhereO manipulates state information 
in the evaluator that influences how expressions are evaluated. 

One way to store this state is as a boolean mask array. Because where-blocks 
can be nested, there must be a stack of such masks, and the top of the stack 
is the mask for the currently active where-block. Alternatively, one can store a 
one-dimensional vector of discrete points where the expression is to be evaluated. 
This would be more efficient than the boolean mask if a small fraction of the 
mask is true. In either case, the Evaluator extracts an evaluation domain from 
the where-block expression and evaluates the expression at each point. 

The evaluator system is designed to be extensible. Several key extensions are 
now under development. 

Multiblock. Multiblock arrays decompose their data into multiple blocks. The 
evaluator intersects the subdomains of a multiblock expression, subsets the ex- 
pression with the intersections, and then evaluates the expression on each sub- 
domain. 

SPMD Parallel. In an SPMD parallel environment, the evaluator employs an 
algorithm such as owner-computes to decide what part of the whole domain 
should be evaluated on the local processor. It then takes a local view of the 
expression on that domain. If arrays in the expression have remote data, they 
must transfer their remote data in order to provide a local view. Once this view 
is constructed, it can be evaluated efficiently. 

Advanced Optimizations. When an expression is ready for evaluation, it need 
not be evaluated immediately so long as there is a mechanism to account for 
data dependencies. There are two important reasons for deferred evaluation: 

— Cache optimization. A given calculation often involves a series of statements 
that use particular arrays multiple times, but each array is too large to fit in 
cache. In that case, it is more efficient to block each statement and evaluate 
one block for a series of statements before working on the next block. 

— Overlapping communication and computation. Typically the parts of a sta- 
tement that require communication are along the boundaries of the domain 
for a given processor. Computation in the interior can proceed while com- 
munication needed for the boundaries is taking place. 

5 Performance 

In order to illustrate performance characteristics of POOMA II, we present a 
sample results using a stencil benchmark code. A stencil expression is an array 
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expression that involves the same array object evaluated at several nearby points. 
Such expressions occur frequently in the numerical approximation of partial 
differential equations, and thus it is important that such expressions be evaluated 
efficiently. 

Consider the simple stencil expression 
B(I) = K * (A(I+1) - A(I-D) 

To produce optimal code, the compiler must know that B is not aliased to any- 
thing on the right-hand side. It can put &B, &A and K in registers and unroll 
the loop so that it can save A(I+1) in a register and reuse this value when it 
needs A(I-l). Accomplishing these optimizations is non-trivial. First, the com- 
piler must be told, via the restrict keyword, that B is not aliased. Second, 
the compiler must be able to see that both occurrences of A refer to the same 
array. This is not guaranteed with expression templates, since the pointer to the 
array being operated on is buried in a TBTree node. Failure to realize this will 
not only prevent loop unrolling, but also result in the use of extra registers. For 
large stencils, the compiler may run out of registers (“register spillage”), which 
greatly impacts performance |EI. 

These problems can be overcome by encapsulating the stencil operation in 
a class. A stencil object calculates the value of the stencil given an array and 
an index. POOMA II stencil objects are fully integrated with the expression- 
template machinery. 

Our stencil benchmark compares four approaches to the evaluation of the 
9-point stencil 

B(I,J) = c * ( A(I-1,J-1) + A(I,J-1) + A(I+1,J-1) + 

A(I-1,J) + A(I,J) + A(I+1,J) + 

A(I-1,J+1) + A(I,J+1) + A(I+1,J+1) ); 

The evaluation methods are C code with restrict, C-style code using POOMA 

II arrays (C-|— I- indexing), POOMA II code using expression templates (POOMA 
II Unoptimized), and POOMA II using stencil objects (POOMA II). 

The benchmark was performed on an SGI Origin 2000 with 32 KB of pri- 
mary cache, 4 MB of secondary cache, and a theoretical peak performance of 
400 MFlops. Figure El shows the results for the four evaluation techniques using 
N X N arrays, where N ranges from 10 to 1000. The C code runs significantly 
faster than the all the C-I--I- versions because it exploits the restrict keyword. 
For N > 40, the arrays are larger than primary cache, but there is little effect 
on performance. For N > 400, the arrays are larger than secondary cache, which 
leads to a large speed reduction. As the curves for the POOMA II unoptimi- 
zed and stencil-object versions demonstrate, there is non-zero overhead in the 
expression-template machinery for small N. The advantage of the stencil-object 
approach over the unoptimized approach is clearly visible for N < 100. This does 
not persist for large N because the loop is not unrolled (no restrict) and the 
stencil is not sufficiently large to cause register spillage. An important result is 
that the stencil-object version performs almost identically to the C-I--I- indexing 
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C with restrict — C++ Indexing A Pooma II (Unoptimized) — o— Poomall 




Problem size, N 



Fig. 2. Stencil benchmark results 



version for N > 30. Once restrict is fully supported for C++, the performance 

of stencils implemented using POOMA II should closely approach that of C. 
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